Skip to main content
Log in

On Identifying Minimal Absent and Unique Words: An Efficient Scheme

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

One of the basic tasks in genomic research is the analysis of a sequence. An absent word in a sequence is a substring that does not occur in the given sequence. Many studies looked into finding the shortest absent words, with some recent studies noting that longer absent words are also of interest. A simple extension of the shortest ones is impractical as the list tends to grow exponentially in the size of the sequence. A better choice is the minimal absent words, since these are known to grow linearly in the size of the sequence. An absent word is minimal if none of its proper factors is missing in the sequence. Similarly, it is (left-fixed) minimal unique if none of its proper prefixes is unique. In this paper we present an efficient algorithm that discovers all words up to a user-specified length that are either minimal absent or are left-fixed minimal unique in the input sequence. We employ a purely deterministic approach which guarantees nothing is overlooked. At each successive iteration, the algorithm works on larger words using a simple list structure for all the operations. Theoretically, the algorithm has a space complexity that is linear with the size of input sequence, while the time bound scales well with alphabet size. Experimental results using real biological sequences and randomly generated ones using different-sized alphabets show that the algorithm has a linearity in time behavior.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. Abouelhoda MI, Ghanem M. String mining in bioinformatics. In: Gaber MM, editor. Scientific data mining and knowledge discovery. Berlin: Springer; 2010. p. 207–47.

    Google Scholar 

  2. Abouelhoda MI, Kurtz S, Ohlebusch E. Replacing suffix trees with enhanced suffix arrays. J Discret Algorithms. 2004;2(1):53–86.

    Article  Google Scholar 

  3. Azmi AM, Al-Ssulami AM. Discovering common recurrent patterns in multiple strings over large alphabets. Pattern Recognit Lett. 2015;54(3):75–81.

    Article  Google Scholar 

  4. Béal MP, Mignosi F, Restivo A. Minimal forbidden words and symbolic dynamics. In: STACS 96 (Grenoble, 1996), Springer, Berlin, lecture notes in computer science; 1996. vol 1046, p. 555–66.

  5. Chairungsee S, Crochemore M. Building phylogeny with minimal absent words. In: Bouchou-Markhoff B, Caron P, Champarnaud JM, Maurel D, editors. Implementation and application of automata, vol. 6807., lecture notes in computer science. Berlin Heidelberg: Springer-Verlag; 2011. p. 100–9.

  6. Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to algorithms. 2nd ed. Cambridge: The MIT Press; 2001.

    Google Scholar 

  7. Darling A, Mau BF, Blattner N, Perna. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004;14(7):394–403.

    Article  Google Scholar 

  8. Garcia SP, Pinho AJ, Rodrigues JMOS, Bastos CAC, Ferreira PJSG. Minimal absent words in prokaryotic and eukaryotic genomes. PLoS One. 2011;6(1):e16065.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Gusfield D. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge: Cambridge University Press; 1997.

    Book  Google Scholar 

  10. Hampikian G, Andersen T. Absent sequences: nullomers and primes. Pac Symp Biocomput. 2000;12:355–66.

    Google Scholar 

  11. Haubold B, Pierstorff N, Möller F, Wiehe T . Genome comparison without alignment using shortest unique substrings. BMC Bioinform. 2005;6:123. doi:10.1186/1471-2105-6-123.

    Article  Google Scholar 

  12. Herold J, Kurtz S, Giegerich R. Efficient computation of absent words in genomic sequences. BMC Bioinform. 2008;9:167. doi:10.1186/1471-2105-9-167.

    Article  Google Scholar 

  13. Hu X, Pei J, Tai Y. Shortest unique queries on strings. In: Proceedings of the 21st International symposium on string processing and information retrieval (SPIRE 2014); 2014. vol 8799, p. 161–72.

  14. Malyshev DA, Dhami K, Lavergne T, Chen T, Dai N, Foster JM, Correa IR, Romesberg FE. A semi-synthetic organism with an expanded genetic alphabet. Nature. 2014. doi:10.1038/nature13314.

  15. McCreight EM. A space-economical suffix tree construction algorithm . J ACM. 1976;23(2):262–72.

    Article  Google Scholar 

  16. Pinho AJ, Ferreira PJSG, Garcia SP. On finding minimal absent words. BMC Bioinform. 2009;10(137): doi:10.1186/1471-2105-10-137.

  17. Service RF. Designer microbes expand life’s genetic alphabet. Science. 2014;344(6184):571. doi:10.1126/science.344.6184.571.

    Article  CAS  PubMed  Google Scholar 

  18. Ukkonen E. On-line construction of suffix trees. Algorithmica. 1995;14(3):249–60.

    Article  Google Scholar 

  19. Wu ZD, Jiang T, Su WJ. Efficient computation of shortest absent words in a genomic sequence. Inf Process Lett. 2010;110(14–15):596–601.

    Article  Google Scholar 

Download references

Acknowledgments

We would like to thank the anonymous reviewers for their helpful comments.

Funding

This study was funded by a special fund in the research center of College of Computer and Information Sciences (CCIS) at King Saud University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aqil M. Azmi.

Ethics declarations

Conflict of Interest

Aqil M. Azmi declare that he has no conflict of interest.

Informed Consent

All procedures followed were in accordance with the ethical standards of the responsible committee on human experimentation (institutional and national) and with the Helsinki Declaration of 1975, as revised in 2008 (5). Additional informed consent was obtained from all patients for which identifying information is included in this article.

Human and Animal Rights

This article does not contain any studies with human or animal subjects performed by the any of the authors.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Azmi, A.M. On Identifying Minimal Absent and Unique Words: An Efficient Scheme. Cogn Comput 8, 603–613 (2016). https://doi.org/10.1007/s12559-016-9385-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-016-9385-9

Keywords

Navigation