On Identifying Minimal Absent and Unique Words: An Efficient Scheme

Azmi, Aqil M.

doi:10.1007/s12559-016-9385-9

On Identifying Minimal Absent and Unique Words: An Efficient Scheme

Published: 23 February 2016

Volume 8, pages 603–613, (2016)
Cite this article

Cognitive Computation Aims and scope Submit manuscript

Aqil M. Azmi¹

229 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

One of the basic tasks in genomic research is the analysis of a sequence. An absent word in a sequence is a substring that does not occur in the given sequence. Many studies looked into finding the shortest absent words, with some recent studies noting that longer absent words are also of interest. A simple extension of the shortest ones is impractical as the list tends to grow exponentially in the size of the sequence. A better choice is the minimal absent words, since these are known to grow linearly in the size of the sequence. An absent word is minimal if none of its proper factors is missing in the sequence. Similarly, it is (left-fixed) minimal unique if none of its proper prefixes is unique. In this paper we present an efficient algorithm that discovers all words up to a user-specified length that are either minimal absent or are left-fixed minimal unique in the input sequence. We employ a purely deterministic approach which guarantees nothing is overlooked. At each successive iteration, the algorithm works on larger words using a simple list structure for all the operations. Theoretically, the algorithm has a space complexity that is linear with the size of input sequence, while the time bound scales well with alphabet size. Experimental results using real biological sequences and randomly generated ones using different-sized alphabets show that the algorithm has a linearity in time behavior.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Linear-time computation of minimal absent words using suffix array

Article Open access 20 December 2014

Minimal Absent Words in a Sliding Window and Applications to On-Line Pattern Matching

Absent words and the (dis)similarity analysis of DNA sequences: an experimental study

Article Open access 22 March 2016

References

Abouelhoda MI, Ghanem M. String mining in bioinformatics. In: Gaber MM, editor. Scientific data mining and knowledge discovery. Berlin: Springer; 2010. p. 207–47.
Google Scholar
Abouelhoda MI, Kurtz S, Ohlebusch E. Replacing suffix trees with enhanced suffix arrays. J Discret Algorithms. 2004;2(1):53–86.
Article Google Scholar
Azmi AM, Al-Ssulami AM. Discovering common recurrent patterns in multiple strings over large alphabets. Pattern Recognit Lett. 2015;54(3):75–81.
Article Google Scholar
Béal MP, Mignosi F, Restivo A. Minimal forbidden words and symbolic dynamics. In: STACS 96 (Grenoble, 1996), Springer, Berlin, lecture notes in computer science; 1996. vol 1046, p. 555–66.
Chairungsee S, Crochemore M. Building phylogeny with minimal absent words. In: Bouchou-Markhoff B, Caron P, Champarnaud JM, Maurel D, editors. Implementation and application of automata, vol. 6807., lecture notes in computer science. Berlin Heidelberg: Springer-Verlag; 2011. p. 100–9.
Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to algorithms. 2nd ed. Cambridge: The MIT Press; 2001.
Google Scholar
Darling A, Mau BF, Blattner N, Perna. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004;14(7):394–403.
Article Google Scholar
Garcia SP, Pinho AJ, Rodrigues JMOS, Bastos CAC, Ferreira PJSG. Minimal absent words in prokaryotic and eukaryotic genomes. PLoS One. 2011;6(1):e16065.
Article CAS PubMed PubMed Central Google Scholar
Gusfield D. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge: Cambridge University Press; 1997.
Book Google Scholar
Hampikian G, Andersen T. Absent sequences: nullomers and primes. Pac Symp Biocomput. 2000;12:355–66.
Google Scholar
Haubold B, Pierstorff N, Möller F, Wiehe T . Genome comparison without alignment using shortest unique substrings. BMC Bioinform. 2005;6:123. doi:10.1186/1471-2105-6-123.
Article Google Scholar
Herold J, Kurtz S, Giegerich R. Efficient computation of absent words in genomic sequences. BMC Bioinform. 2008;9:167. doi:10.1186/1471-2105-9-167.
Article Google Scholar
Hu X, Pei J, Tai Y. Shortest unique queries on strings. In: Proceedings of the 21st International symposium on string processing and information retrieval (SPIRE 2014); 2014. vol 8799, p. 161–72.
Malyshev DA, Dhami K, Lavergne T, Chen T, Dai N, Foster JM, Correa IR, Romesberg FE. A semi-synthetic organism with an expanded genetic alphabet. Nature. 2014. doi:10.1038/nature13314.
McCreight EM. A space-economical suffix tree construction algorithm . J ACM. 1976;23(2):262–72.
Article Google Scholar
Pinho AJ, Ferreira PJSG, Garcia SP. On finding minimal absent words. BMC Bioinform. 2009;10(137): doi:10.1186/1471-2105-10-137.
Service RF. Designer microbes expand life’s genetic alphabet. Science. 2014;344(6184):571. doi:10.1126/science.344.6184.571.
Article CAS PubMed Google Scholar
Ukkonen E. On-line construction of suffix trees. Algorithmica. 1995;14(3):249–60.
Article Google Scholar
Wu ZD, Jiang T, Su WJ. Efficient computation of shortest absent words in a genomic sequence. Inf Process Lett. 2010;110(14–15):596–601.
Article Google Scholar

Download references

Acknowledgments

We would like to thank the anonymous reviewers for their helpful comments.

Funding

This study was funded by a special fund in the research center of College of Computer and Information Sciences (CCIS) at King Saud University.

Author information

Authors and Affiliations

Department of Computer Science, College of Computer and Information Sciences, King Saud University, Riyadh, 11543, Saudi Arabia
Aqil M. Azmi

Authors

Aqil M. Azmi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aqil M. Azmi.

Ethics declarations

Conflict of Interest

Aqil M. Azmi declare that he has no conflict of interest.

Informed Consent

All procedures followed were in accordance with the ethical standards of the responsible committee on human experimentation (institutional and national) and with the Helsinki Declaration of 1975, as revised in 2008 (5). Additional informed consent was obtained from all patients for which identifying information is included in this article.

Human and Animal Rights

This article does not contain any studies with human or animal subjects performed by the any of the authors.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Azmi, A.M. On Identifying Minimal Absent and Unique Words: An Efficient Scheme. Cogn Comput 8, 603–613 (2016). https://doi.org/10.1007/s12559-016-9385-9

Download citation

Received: 23 February 2015
Accepted: 05 February 2016
Published: 23 February 2016
Issue Date: August 2016
DOI: https://doi.org/10.1007/s12559-016-9385-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On Identifying Minimal Absent and Unique Words: An Efficient Scheme

Abstract

Access this article

Similar content being viewed by others

Linear-time computation of minimal absent words using suffix array

Minimal Absent Words in a Sliding Window and Applications to On-Line Pattern Matching

Absent words and the (dis)similarity analysis of DNA sequences: an experimental study

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Informed Consent

Human and Animal Rights

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On Identifying Minimal Absent and Unique Words: An Efficient Scheme

Abstract

Access this article

Similar content being viewed by others

Linear-time computation of minimal absent words using suffix array

Minimal Absent Words in a Sliding Window and Applications to On-Line Pattern Matching

Absent words and the (dis)similarity analysis of DNA sequences: an experimental study

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Informed Consent

Human and Animal Rights

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation