Skip to main content

Expected Density of Random Minimizers

  • Conference paper
  • First Online:
SOFSEM 2025: Theory and Practice of Computer Science (SOFSEM 2025)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15538))

  • 23 Accesses

Abstract

Minimizer schemes, or just minimizers, are a very important computational primitive in sampling and sketching biological strings. Assuming a fixed alphabet of size \(\sigma \), a minimizer is defined by two integers \(k,w\ge 2\) and a total order \(\rho \) on strings of length k (also called k-mers). A string is processed by a sliding window algorithm that chooses, in each window of length \(w+k-1\), its minimal k-mer with respect to \(\rho \). A key characteristic of the minimizer is the expected density of chosen k-mers among all k-mers in a random infinite \(\sigma \)-ary string. Random minimizers, in which the order \(\rho \) is chosen uniformly at random, are often used in applications. However, little is known about their expected density \(\mathcal{D}\mathcal{R}_\sigma (k,w)\) besides the fact that it is close to \(\frac{2}{w+1}\) unless \(w\gg k\).

   We first show that \(\mathcal{D}\mathcal{R}_\sigma (k,w)\) can be computed in \(O(k\sigma ^{k+w})\) time. Then we attend to the case \(w\le k\) and present a formula that allows one to compute \(\mathcal{D}\mathcal{R}_\sigma (k,w)\) in just \(O(w\log w)\) time. Further, we describe the behaviour of \(\mathcal{D}\mathcal{R}_\sigma (k,w)\) in this case, establishing the connection between \(\mathcal{D}\mathcal{R}_\sigma (k,w)\), \(\mathcal{D}\mathcal{R}_\sigma (k+1,w)\), and \(\mathcal{D}\mathcal{R}_\sigma (k,w+1)\). In particular, we show that \(\mathcal{D}\mathcal{R}_\sigma (k,w)<\frac{2}{w+1}\) (by a tiny margin) unless w is small. We conclude with some partial results and conjectures for the case \(w>k\).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Chikhi, R., Holub, J., Medvedev, P.: Data structures to represent a set of k-long DNA sequences. ACM Comput. Surv. 54(1) (2021). https://doi.org/10.1145/3445967

  2. Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569–1576 (2015)

    Article  MATH  Google Scholar 

  3. Edgar, R.: Syncmers are more sensitive than minimizers for selecting conserved \(k\)-mers in biological sequences. PeerJ 9, e10805 (2021). https://doi.org/10.7717/peerj.10805

  4. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: 41st Annual Symposium on Foundations of Computer Science. FOCS 2000, pp. 390–398. IEEE Computer Society (2000). https://doi.org/10.1109/SFCS.2000.892127

  5. Fine, N.J., Wilf, H.S.: Uniqueness theorems for periodic functions. Proc. Am. Math. Soc. 16(1), 109–114 (1965)

    Article  MathSciNet  MATH  Google Scholar 

  6. Golan, S., Shur, A.M.: Expected density of random minimizers (2024). arxiv:2410.16968

  7. Golomb, S.W.: Shift Register Sequences. Holden–Day, San Francisco (1967)

    Google Scholar 

  8. Groot Koerkamp, R., Pibiri, G.E.: The mod-minimizer: a simple and efficient sampling algorithm for long k-mers. In: Pissis, S.P., Sung, W. (eds.) 24th International Workshop on Algorithms in Bioinformatics. WABI 2024. LIPIcs, vol. 312, pp. 11:1–11:23. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2024). https://doi.org/10.4230/LIPICS.WABI.2024.11

  9. Guibas, L.J., Odlyzko, A.M.: Maximal prefix-synchronized codes. SIAM J. Appl. Math. 35, 401–418 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  10. Guibas, L.J., Odlyzko, A.M.: String overlaps, pattern matching, and nontransitive games. J. Comb. Theory A 30, 183–208 (1981)

    Article  MathSciNet  MATH  Google Scholar 

  11. Lempel, A.: On extremal factors of the de Bruijn graph. J. Comb. Theory B 11, 17–27 (1971)

    Article  MathSciNet  MATH  Google Scholar 

  12. Li, H.: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18), 3094–3100 (2018)

    Article  MATH  Google Scholar 

  13. Lothaire, M. (ed.): Combinatorics on Words, 2 edn. Cambridge Mathematical Library. Cambridge University Press, Cambridge (1997)

    Google Scholar 

  14. Mykkeltveit, J.: A proof of Golomb’s conjecture for the de Bruijn graph. J. Comb. Theory B 13, 40–45 (1972)

    Article  MathSciNet  MATH  Google Scholar 

  15. Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004). https://doi.org/10.1093/bioinformatics/bth408

  16. Rubinchik, M., Shur, A.M.: The number of distinct subpalindromes in random words. Fundam. Informaticae 145(3), 371–384 (2016). https://doi.org/10.3233/FI-2016-1366

    Article  MathSciNet  MATH  Google Scholar 

  17. Sahlin, K.: Effective sequence similarity detection with strobemers. Genome Res. 31(11), 2080–2094 (2021). https://doi.org/10.1101/gr.275648.121

    Article  MATH  Google Scholar 

  18. Sahlin, K., Baudeau, T., Cazaux, B., Marchet, C.: A survey of mapping algorithms in the long-reads era. Genome Biol. 24, 133 (2023). https://doi.org/10.1186/s13059-023-02972-3

  19. Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. SIGMOD ’03, pp. 76–85. Association for Computing Machinery, New York, NY, USA (2003). https://doi.org/10.1145/872757.872770

  20. Weiner, P.: Linear pattern matching algorithms. In: 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, 15–17 October 1973, pp. 1–11. IEEE Computer Society (1973). https://doi.org/10.1109/SWAT.1973.13

  21. Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, 1–12 (2014)

    Article  MATH  Google Scholar 

  22. Zheng, H., Kingsford, C., Marçais, G.: Improved design and analysis of practical minimizers. Bioinformatics 36, i119–i127 (2020). https://doi.org/10.1093/bioinformatics/btaa472

Download references

Acknowledgments

S. Golan is supported by Israel Science Foundation grant no. 810/21. A. Shur is supported by the ERC grant MPM no. 683064 under the EU’s Horizon 2020 Research and Innovation Programme and by the State of Israel through the Center for Absorption in Science of the Ministry of Aliyah and Immigration.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arseny M. Shur .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Golan, S., Shur, A.M. (2025). Expected Density of Random Minimizers. In: Královič, R., Kůrková, V. (eds) SOFSEM 2025: Theory and Practice of Computer Science. SOFSEM 2025. Lecture Notes in Computer Science, vol 15538. Springer, Cham. https://doi.org/10.1007/978-3-031-82670-2_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-82670-2_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-82669-6

  • Online ISBN: 978-3-031-82670-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics