Abstract
Motifs are relatively short sequences that are biologically significant, and their discovery in molecular sequences is a well-researched subject. A don’t care is a special letter that matches every letter in the alphabet. Formally, a motif is a sequence of letters of the alphabet and don’t care letters. A motif \(\tilde{m}_{d,k}\) that occurs at least k times in a sequence is maximal if it cannot be extended (to the left or right) nor can it be specialised (that is, its \(d' \le d\) don’t cares cannot be replaced with letters from the alphabet) without reducing its number of occurrences. Here we present a new dynamic data structure, and the first on-line algorithm, to discover all maximal motifs in a sliding window of length \(\ell \) on a sequence x of length n in \(\mathcal {O}(nd\ell + d\lceil \frac{\ell }{w}\rceil \cdot \sum _{i = \ell }^{n-1} |{\textsc {diff}}_{i-1}^{i}|)\) time, where w is the size of the machine word and \({\textsc {diff}}_{i-1}^{i}\) is the symmetric difference of the sets of occurrences of maximal motifs at \(x[i-\ell \mathinner {.\,.}i-1]\) and at \(x[i-\ell +1 \mathinner {.\,.}i]\).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Carvalho, A.M., Freitas, A.T., Oliveira, A.L., Sagot, M.: An efficient algorithm for the identification of structured motifs in DNA promoter sequences. IEEE/ACM Trans. Comput. Biol. Bioinform. 3(2), 126–140 (2006)
Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, Cambridge (2007)
Fuller, R.S., Funnell, B.E., Kornberg, A.: The dnaA protein complex with the E. coli chromosomal replication origin (oriC) and other DNA sites. Cell 38(3), 889–900 (1984)
Grossi, R., Menconi, G., Pisanti, N., Trani, R., Vind, S.: Motif trie: an efficient text index for pattern discovery with don’t cares. Theor. Comput. Sci. 710, 74–87 (2018)
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
van Helden, J., Andre, B., Collado-Vides, J.: Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol. 281(5), 827–842 (1998)
Leonard, A.C., Méchali, M.: DNA replication origins. Cold Spring Harb. Perspect. Biol. 5(10), a010116 (2013)
Meijer, M., et al.: Nucleotide sequence of the origin of replication of the Escherichia coli K-12 chromosome. Proc. Natl. Acad. Sci. 76(2), 580–584 (1979)
Pavesi, G., Mereghetti, P., Mauri, G., Pesole, G.: Weeder web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic Acids Res. 32(Web–Server–Issue), 199–203 (2004)
Pisanti, N., Carvalho, A.M., Marsan, L., Sagot, M.-F.: RISOTTO: fast extraction of motifs with mismatches. In: Correa, J.R., Hevia, A., Kiwi, M. (eds.) LATIN 2006. LNCS, vol. 3887, pp. 757–768. Springer, Heidelberg (2006). https://doi.org/10.1007/11682462_69
Pissis, S.P.: MoTeX-II: structured MoTif eXtraction from large-scale datasets. BMC Bioinform. 15, 235 (2014)
Pissis, S.P., Stamatakis, A., Pavlidis, P.: MoTeX: a word-based HPC tool for motif extraction. In: Gao, J. (ed.) ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics, ACM-BCB 2013, Washington, DC, USA, 22–25 September 2013, p. 13. ACM (2013)
Senft, M.: Suffix tree for a sliding window: an overview. In: WDS, vol. 5, pp. 41–46 (2005)
Sinha, S., Tompa, M.: YMF: a program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 31(13), 3586–3588 (2003)
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Waterman, M.S.: General methods of sequence comparison. Bull. Math. Biol. 46(4), 473–500 (1984)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Iliopoulos, C.S., Mohamed, M., Pissis, S.P., Vayani, F. (2018). Maximal Motif Discovery in a Sliding Window. In: Gagie, T., Moffat, A., Navarro, G., Cuadros-Vargas, E. (eds) String Processing and Information Retrieval. SPIRE 2018. Lecture Notes in Computer Science(), vol 11147. Springer, Cham. https://doi.org/10.1007/978-3-030-00479-8_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-00479-8_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00478-1
Online ISBN: 978-3-030-00479-8
eBook Packages: Computer ScienceComputer Science (R0)