Abstract
One of the basic tasks in genomic research is the analysis of a sequence. An absent word in a sequence is a substring that does not occur in the given sequence. Many studies looked into finding the shortest absent words, with some recent studies noting that longer absent words are also of interest. A simple extension of the shortest ones is impractical as the list tends to grow exponentially in the size of the sequence. A better choice is the minimal absent words, since these are known to grow linearly in the size of the sequence. An absent word is minimal if none of its proper factors is missing in the sequence. Similarly, it is (left-fixed) minimal unique if none of its proper prefixes is unique. In this paper we present an efficient algorithm that discovers all words up to a user-specified length that are either minimal absent or are left-fixed minimal unique in the input sequence. We employ a purely deterministic approach which guarantees nothing is overlooked. At each successive iteration, the algorithm works on larger words using a simple list structure for all the operations. Theoretically, the algorithm has a space complexity that is linear with the size of input sequence, while the time bound scales well with alphabet size. Experimental results using real biological sequences and randomly generated ones using different-sized alphabets show that the algorithm has a linearity in time behavior.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abouelhoda MI, Ghanem M. String mining in bioinformatics. In: Gaber MM, editor. Scientific data mining and knowledge discovery. Berlin: Springer; 2010. p. 207–47.
Abouelhoda MI, Kurtz S, Ohlebusch E. Replacing suffix trees with enhanced suffix arrays. J Discret Algorithms. 2004;2(1):53–86.
Azmi AM, Al-Ssulami AM. Discovering common recurrent patterns in multiple strings over large alphabets. Pattern Recognit Lett. 2015;54(3):75–81.
Béal MP, Mignosi F, Restivo A. Minimal forbidden words and symbolic dynamics. In: STACS 96 (Grenoble, 1996), Springer, Berlin, lecture notes in computer science; 1996. vol 1046, p. 555–66.
Chairungsee S, Crochemore M. Building phylogeny with minimal absent words. In: Bouchou-Markhoff B, Caron P, Champarnaud JM, Maurel D, editors. Implementation and application of automata, vol. 6807., lecture notes in computer science. Berlin Heidelberg: Springer-Verlag; 2011. p. 100–9.
Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to algorithms. 2nd ed. Cambridge: The MIT Press; 2001.
Darling A, Mau BF, Blattner N, Perna. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004;14(7):394–403.
Garcia SP, Pinho AJ, Rodrigues JMOS, Bastos CAC, Ferreira PJSG. Minimal absent words in prokaryotic and eukaryotic genomes. PLoS One. 2011;6(1):e16065.
Gusfield D. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge: Cambridge University Press; 1997.
Hampikian G, Andersen T. Absent sequences: nullomers and primes. Pac Symp Biocomput. 2000;12:355–66.
Haubold B, Pierstorff N, Möller F, Wiehe T . Genome comparison without alignment using shortest unique substrings. BMC Bioinform. 2005;6:123. doi:10.1186/1471-2105-6-123.
Herold J, Kurtz S, Giegerich R. Efficient computation of absent words in genomic sequences. BMC Bioinform. 2008;9:167. doi:10.1186/1471-2105-9-167.
Hu X, Pei J, Tai Y. Shortest unique queries on strings. In: Proceedings of the 21st International symposium on string processing and information retrieval (SPIRE 2014); 2014. vol 8799, p. 161–72.
Malyshev DA, Dhami K, Lavergne T, Chen T, Dai N, Foster JM, Correa IR, Romesberg FE. A semi-synthetic organism with an expanded genetic alphabet. Nature. 2014. doi:10.1038/nature13314.
McCreight EM. A space-economical suffix tree construction algorithm . J ACM. 1976;23(2):262–72.
Pinho AJ, Ferreira PJSG, Garcia SP. On finding minimal absent words. BMC Bioinform. 2009;10(137): doi:10.1186/1471-2105-10-137.
Service RF. Designer microbes expand life’s genetic alphabet. Science. 2014;344(6184):571. doi:10.1126/science.344.6184.571.
Ukkonen E. On-line construction of suffix trees. Algorithmica. 1995;14(3):249–60.
Wu ZD, Jiang T, Su WJ. Efficient computation of shortest absent words in a genomic sequence. Inf Process Lett. 2010;110(14–15):596–601.
Acknowledgments
We would like to thank the anonymous reviewers for their helpful comments.
Funding
This study was funded by a special fund in the research center of College of Computer and Information Sciences (CCIS) at King Saud University.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
Aqil M. Azmi declare that he has no conflict of interest.
Informed Consent
All procedures followed were in accordance with the ethical standards of the responsible committee on human experimentation (institutional and national) and with the Helsinki Declaration of 1975, as revised in 2008 (5). Additional informed consent was obtained from all patients for which identifying information is included in this article.
Human and Animal Rights
This article does not contain any studies with human or animal subjects performed by the any of the authors.
Rights and permissions
About this article
Cite this article
Azmi, A.M. On Identifying Minimal Absent and Unique Words: An Efficient Scheme. Cogn Comput 8, 603–613 (2016). https://doi.org/10.1007/s12559-016-9385-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-016-9385-9