Elsevier

Information Sciences

Volume 237, 10 July 2013, Pages 305-312
Information Sciences

An improved voting algorithm for planted (l, d) motif search

https://doi.org/10.1016/j.ins.2013.03.023Get rights and content

Abstract

The planted motif search problem is a classical problem in bioinformatics that seeks to identify meaningful patterns in biological sequences. As an NP-complete problem, current algorithms focus on improving the average time complexity and solving challenging instances within an acceptable time. In this paper, we propose a new exact algorithm CVoting that improves the state-of-the-art Voting algorithm. CVoting uses a new hash technique to reduce the space complexity to O(mn + N(l, d)) and a new pruning technique to reduce the average time complexity to Om2nN(l,d)14+3ll. Experimental results show that CVoting outperforms competing algorithms, including PMS1, RISOTTO, Voting and Pmsprune, in both space and time: up to an order of magnitude faster and using less memory in solving challenging instances. The software of the proposed algorithm is publicly available at http://staff.ustc.edu.cn/xuyun/motif.

Introduction

The planted (l, d)-motif search is a classical problem of motif discovery in bioinformatics due to its importance in identifying meaningful patterns in biological sequences. Patterns such as transcription factor binding sites (TFBSs) and splice sites are called motifs, which are recurring and conserved regions in biological sequences [8]. Since motifs have molecular structural or functional features related to the behaviors of DNA, RNA, or proteins [4], [13], [16], [22], [27], the identification of them can help us better understand the mechanisms of life.

The formal description of the planted (l, d)-motif search problem is as follows [6], [7], [24]:

Definition 1

Given n sequences s1, s2,  , sn of length m over a finite alphabet Σ = {A, C, G, T} and two integers l and d, 0  d < l < m, the planted (l, d)-motif search problem is to find all strings of length l, which are also called “motif”, such that for each sequence there exists a length l substring whose Hamming distance from the motif ⩽d.

Planted (l, d)-motif search algorithms can be divided into two categories: approximate and exact algorithms. A motif search algorithm is approximate if it does not guarantee finding all solutions (or the optimal solutions). Some approximate algorithms, such as WEEDER [19], VINE [15], Pattern Branching [23] and Random Projection [25], apply greedy or heuristic search techniques to speed up the execution time. Others apply statistical techniques, including Expectation Maximization (EM) [2], [18], Gibbs Sampling [17] and hidden Markov models [31].

In this paper, we focus on exact algorithms that are able to guarantee solution quality. Since the planted (l, d)-motif search problem has been proved to be NP-complete [11], no polynomial time algorithm exists unless P = NP. In practice, efficient exact algorithms are developed to achieve lower average time complexity and to increase the size of practical instances that can be solved. In the community, some relatively larger size instances have been proposed as the so called challenging instances. The first challenging instance proposed by Pevzner and Sze [20] was to find a (15, 4)-motif in 20 random sequences with length 600. When solving this instance, Buhler and Tompa [25] found that if the sequences were selected randomly, the expected number of (l,d)-motif would be more than 1 for some (l, d) with n = 20 and m = 600, where n is the number of sequences and m is the length of sequences. Davila et al. [7] proposed to include the (11, 3), (13, 4), (15, 5), (17, 6), or (19, 7) random problems as challenging instances.

Exact algorithms adopt various techniques to enumerate potential motifs through analyzing patterns in the sequences and using different pruning strategies to reduce the search space. The state-of-the-art algorithms include MITRA [10], PMS1 [24], CENSUS [12], RISOTTO [21], SPELLER [26], Voting [6] and Pmsprune [7]. They organize the search space into a graph, tree, or hash table.

Graph-based methods construct a graph in which each substring of length l of each sequence is a vertex and two vertices are linked by an edge if and only if the two substrings are from different sequences and their Hamming distance is smaller than 2d. In this graph, a motif corresponds to a clique of size n. RecMotif [29] applies the concepts of reference sequence and reference vertex to find this kind of cliques. Pevzner and Sze proposed the WINNOWER [20] algorithm to eliminate spurious edges and its time complexity is O((mn)2d+1). MITRA [10] improves the efficiency of edge pruning in WINNOWER by using mismatch trees. However, neither WINNOWER nor MITRA can solve (15, 5)-motif instances in a reasonable amount of time.

Tree-based methods, including SPELLER [26] and RISOTTO [21], are based on suffix-trees and apply efficient techniques to prune prefixes. The time complexity and space complexity of both SPELLER and RISOTTO are O(n2mN(l, d)) and O(n2m), respectively, where N(l,d)=i=0dli3i, although RISOTTO is shown to run faster in simulations. CENSUS [12] uses lexicographic trees to prune spurious prefixes; its time complexity and space complexity are O(lmnN(l, d)) and O(lmn), respectively. TreeMotif [30] uses a new deterministic tree structure to discover motif with time complexity O(nm4p2), where p=i=02dli(3/4)i(1/4)l-i. Rajasekaran et al. proposed the PMS1 algorithm [24] based on a binomial-tree-like data structure. The time complexity and space complexity of PMS1 are O(mnN(l, d)) and O(mN(l, d)), respectively. Recently, Davila et al. proposed the Pmsprune algorithm [7] by extending PMS1. Despite the worst-case time complexity of Pmsprune, O(m2nN(l, d)), Pmsprune algorithm works well in practical applications because many impossible prefixes can be pruned. The space complexity of Pmsprune is O(m2n). As far as we know, Pmsprune is the first exact motif search algorithm that is capable of solving (19, 7) instances.

Hash table based methods use hashing techniques to store all potential motifs instead of pruning them. Ho et al. presented the iTriplet algorithm [14] by generating triplets and keeping putative motifs to triplets in hash table and its time complexity and space complexity are O(m3npl3d2) and O(N(l, d)), respectively. The Voting algorithm proposed by Chin and Leung [6] has time complexity O(mnN(l, d)), which appears to be the best among all existing algorithms. However, its space complexity is O(mN(l, d)), which will be too high when l is larger than 15.

Other motif discovery algorithms could also be found in [1], [3], [5], [9], [32], etc. In this paper, we propose an exact algorithm CVoting that improves Voting in both time and space complexity for large-size problems. CVoting uses a new hash technique that has no collision when storing the d-neighbors of a single length-l string so that the space complexity can be reduced to O(mn + N(l, d)). In addition, it applies a new pruning technique to prune candidate length-l strings. Its average time complexity is Om2nN(l,d)14+3ll, lower than the Voting algorithm’s when m14+3ll<1. Empirically, CVoting is faster than all existing algorithms. It solves (19,7) challenge instances in almost one hour whereas Pmsprune needs more than 11 h.

Section snippets

CVoting – an improved algorithm based on voting

First, a neighborhood definition between two strings is introduced in this section:

Definition 2

Given two strings x and y of length l, if the Hamming distance between x and y is no larger than d, then y is called a d-neighbor of x (x is also a d-neighbor of y). Let Nd(x) = {yy is a d -neighbor of x}, then |Nd(x)|=N(l,d)=i=0dli3i.

The d-neighbor relationship is reflexive and symmetric, yet not transitive.

CVoting is developed based on the Voting algorithm [6], which works as follows: the d-neighbors of each

Experimental results

In this section, CVoting is compared with four state-of-the-art exact algorithms (Voting, Pmsprune, PMS1 and RISOTTO) in solving challenging instances. All experiments were conducted on a Linux server with Core i5-2400 3.1 GHz CPU and 6 GB RAM. For each experiment, we generated 10 random datasets. The results were obtained through 10 runs.

Similar to the simulations presented in [6], [7], we first tested instances of 20 DNA sequences of length 600 bp in three different cases: randomly generated

Summary

In this paper, a new exact algorithm, CVoting, is proposed for solving planted (l, d)-motif search problems. The algorithm has attractive theoretical properties in terms of both low space complexity and average time complexity. In simulations, when compared to four of the best existing algorithms (PMS1, RISOTTO, Voting and Pmsprune), CVoting outperforms them all significantly in challenging benchmark instances. The software implementing the proposed algorithm is publicly available at //staff.ustc.edu.cn/xuyun/motif

Acknowledgements

We thank Linbin Yu, Yiming Lei and Mingzhi Shao for their helpful suggestions for our article. This work is supported in part by the National Natural Science Foundation of China (Nos. 61033009 and 60970085).

References (32)

  • J. Davila et al.

    Fast and practical algorithms for planted (l,d) motif search

    IEEE/ACM Transactions on Computational Biology and Bioinformatics

    (2007)
  • P. D’haeseleer

    What are DNA sequence motifs?

    Nature Biotechnology

    (2006)
  • H. Dinh et al.

    PMS5: an efficient exact algorithm for the (l,d)-motif finding problem

    BMC Bioinformatics

    (2011)
  • E. Eskin et al.

    Finding composite regulatory patterns in DNA sequences

    Bioinformatics

    (2002)
  • P.A. Evans et al.

    Toward optimal motif enumeration

    Algorithms and Data Structures

    (2003)
  • E.S. Ho et al.

    iTriplet, a rule-based nucleic acid sequence motif finder

    Algorithms for Molecular Biology

    (2009)
  • Cited by (9)

    View all citing articles on Scopus
    View full text