Abstract
Minimizer schemes, or just minimizers, are a very important computational primitive in sampling and sketching biological strings. Assuming a fixed alphabet of size \(\sigma \), a minimizer is defined by two integers \(k,w\ge 2\) and a total order \(\rho \) on strings of length k (also called k-mers). A string is processed by a sliding window algorithm that chooses, in each window of length \(w+k-1\), its minimal k-mer with respect to \(\rho \). A key characteristic of the minimizer is the expected density of chosen k-mers among all k-mers in a random infinite \(\sigma \)-ary string. Random minimizers, in which the order \(\rho \) is chosen uniformly at random, are often used in applications. However, little is known about their expected density \(\mathcal{D}\mathcal{R}_\sigma (k,w)\) besides the fact that it is close to \(\frac{2}{w+1}\) unless \(w\gg k\).
We first show that \(\mathcal{D}\mathcal{R}_\sigma (k,w)\) can be computed in \(O(k\sigma ^{k+w})\) time. Then we attend to the case \(w\le k\) and present a formula that allows one to compute \(\mathcal{D}\mathcal{R}_\sigma (k,w)\) in just \(O(w\log w)\) time. Further, we describe the behaviour of \(\mathcal{D}\mathcal{R}_\sigma (k,w)\) in this case, establishing the connection between \(\mathcal{D}\mathcal{R}_\sigma (k,w)\), \(\mathcal{D}\mathcal{R}_\sigma (k+1,w)\), and \(\mathcal{D}\mathcal{R}_\sigma (k,w+1)\). In particular, we show that \(\mathcal{D}\mathcal{R}_\sigma (k,w)<\frac{2}{w+1}\) (by a tiny margin) unless w is small. We conclude with some partial results and conjectures for the case \(w>k\).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chikhi, R., Holub, J., Medvedev, P.: Data structures to represent a set of k-long DNA sequences. ACM Comput. Surv. 54(1) (2021). https://doi.org/10.1145/3445967
Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569–1576 (2015)
Edgar, R.: Syncmers are more sensitive than minimizers for selecting conserved \(k\)-mers in biological sequences. PeerJ 9, e10805 (2021). https://doi.org/10.7717/peerj.10805
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: 41st Annual Symposium on Foundations of Computer Science. FOCS 2000, pp. 390–398. IEEE Computer Society (2000). https://doi.org/10.1109/SFCS.2000.892127
Fine, N.J., Wilf, H.S.: Uniqueness theorems for periodic functions. Proc. Am. Math. Soc. 16(1), 109–114 (1965)
Golan, S., Shur, A.M.: Expected density of random minimizers (2024). arxiv:2410.16968
Golomb, S.W.: Shift Register Sequences. Holden–Day, San Francisco (1967)
Groot Koerkamp, R., Pibiri, G.E.: The mod-minimizer: a simple and efficient sampling algorithm for long k-mers. In: Pissis, S.P., Sung, W. (eds.) 24th International Workshop on Algorithms in Bioinformatics. WABI 2024. LIPIcs, vol. 312, pp. 11:1–11:23. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2024). https://doi.org/10.4230/LIPICS.WABI.2024.11
Guibas, L.J., Odlyzko, A.M.: Maximal prefix-synchronized codes. SIAM J. Appl. Math. 35, 401–418 (1978)
Guibas, L.J., Odlyzko, A.M.: String overlaps, pattern matching, and nontransitive games. J. Comb. Theory A 30, 183–208 (1981)
Lempel, A.: On extremal factors of the de Bruijn graph. J. Comb. Theory B 11, 17–27 (1971)
Li, H.: Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34(18), 3094–3100 (2018)
Lothaire, M. (ed.): Combinatorics on Words, 2 edn. Cambridge Mathematical Library. Cambridge University Press, Cambridge (1997)
Mykkeltveit, J.: A proof of Golomb’s conjecture for the de Bruijn graph. J. Comb. Theory B 13, 40–45 (1972)
Roberts, M., Hayes, W., Hunt, B.R., Mount, S.M., Yorke, J.A.: Reducing storage requirements for biological sequence comparison. Bioinformatics 20(18), 3363–3369 (2004). https://doi.org/10.1093/bioinformatics/bth408
Rubinchik, M., Shur, A.M.: The number of distinct subpalindromes in random words. Fundam. Informaticae 145(3), 371–384 (2016). https://doi.org/10.3233/FI-2016-1366
Sahlin, K.: Effective sequence similarity detection with strobemers. Genome Res. 31(11), 2080–2094 (2021). https://doi.org/10.1101/gr.275648.121
Sahlin, K., Baudeau, T., Cazaux, B., Marchet, C.: A survey of mapping algorithms in the long-reads era. Genome Biol. 24, 133 (2023). https://doi.org/10.1186/s13059-023-02972-3
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. SIGMOD ’03, pp. 76–85. Association for Computing Machinery, New York, NY, USA (2003). https://doi.org/10.1145/872757.872770
Weiner, P.: Linear pattern matching algorithms. In: 14th Annual Symposium on Switching and Automata Theory, Iowa City, Iowa, USA, 15–17 October 1973, pp. 1–11. IEEE Computer Society (1973). https://doi.org/10.1109/SWAT.1973.13
Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, 1–12 (2014)
Zheng, H., Kingsford, C., Marçais, G.: Improved design and analysis of practical minimizers. Bioinformatics 36, i119–i127 (2020). https://doi.org/10.1093/bioinformatics/btaa472
Acknowledgments
S. Golan is supported by Israel Science Foundation grant no. 810/21. A. Shur is supported by the ERC grant MPM no. 683064 under the EU’s Horizon 2020 Research and Innovation Programme and by the State of Israel through the Center for Absorption in Science of the Ministry of Aliyah and Immigration.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Golan, S., Shur, A.M. (2025). Expected Density of Random Minimizers. In: Královič, R., Kůrková, V. (eds) SOFSEM 2025: Theory and Practice of Computer Science. SOFSEM 2025. Lecture Notes in Computer Science, vol 15538. Springer, Cham. https://doi.org/10.1007/978-3-031-82670-2_25
Download citation
DOI: https://doi.org/10.1007/978-3-031-82670-2_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-82669-6
Online ISBN: 978-3-031-82670-2
eBook Packages: Computer ScienceComputer Science (R0)