Abstract:
Finding approximately conserved sequences, called motifs, across multiple DNA or protein sequences is an important problem in computational biology. In this paper, we con...Show MoreMetadata
Abstract:
Finding approximately conserved sequences, called motifs, across multiple DNA or protein sequences is an important problem in computational biology. In this paper, we consider the (l; d) motif search problem of identifying one or more motifs of length l present in at least q of the n given sequences, with each occurrence differing from the motif in at most d substitutions. The problem is known to be NP-complete, and the largest solved instance reported to date is (26;11). We propose a novel algorithm for the (l; d) motif search problem using streaming execution over a large set of non-deterministic finite automata (NFA). This solution is designed to take advantage of the micron automata processor, a new technology close to deployment that can simultaneously execute multiple NFA in parallel. We demonstrate the capability for solving much larger instances of the (l; d) motif search problem using the resources available within a single automata processor board, by estimating run-times for problem instances (39; 18) and (40; 17). The paper serves as a useful guide to solving problems using this new accelerator technology.
Published in: IEEE/ACM Transactions on Computational Biology and Bioinformatics ( Volume: 13, Issue: 1, 01 Jan.-Feb. 2016)