Abstract
The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let \(e_k(n)\) denote the average edit distance between random, independent strings of n characters from an alphabet of a given size k. An open problem is the exact value of \(\alpha _{k}(n)= e_k(n)/n\). While it is known that, for increasing n, \(\alpha _{k}(n)\) approaches a limit \(\alpha _{k}\), the exact value of this limit is unknown, for any \(k\ge 2\). This paper presents an upper bound to \(\alpha _{k}\) based on the exact computation of some \(\alpha _k(n)\) and a lower bound to \(\alpha _{k}\) based on combinatorial arguments on edit scripts. Statistical estimates of \(\alpha _{k}(n)\) are also obtained, with analysis of error and of confidence intervals. The techniques are applied to several alphabet sizes k. In particular, for a binary alphabet, the rigorous bounds are \(0.1742 \le \alpha _2 \le 0.3693\) while the obtained estimate is \(\alpha _2 \approx 0.2888\); for a quaternary alphabet, \(0.3598 \le \alpha _4 \le 0.6318\) and \(\alpha _4 \approx 0.5180\). These values are more accurate than those previously published.
This work was partially supported by University of Padova projects CPDA152255/15 and CPGA3/13; by MIUR, the Italian Ministry of Education, University and Research, under Grant 20174LF3T8 AHeAD: efficient Algorithms for HArnessing networked Data; and by an IBM SUR Grant.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
A similar algorithm computes the length of the LCS. The recurrence (3) becomes \(M_{i,0} = 0\), \(M_{0,j} = 0\), and \(M_{i,j} = \max {\{ M_{i-1,j-1} + (1-\xi _{i,j}) ; M_{i-1,j} ; M_{i,j-1} \}}.\)
References
Abboud, A., Backurs, A., Williams, V.V.: Tight hardness results for LCS and other sequence similarity measures. In: 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pp. 59–78 (2015). https://doi.org/10.1109/FOCS.2015.14
Andoni, A., Krauthgamer, R., Onak, K.: Polylogarithmic approximation for edit distance and the asymmetric query complexity. In: 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 377–386 (2010). https://doi.org/10.1109/FOCS.2010.43
Backurs, A., Indyk, P.: Edit distance cannot be computed in strongly subquadratic time (unless seth is false). In: Proceedings of the Forty-seventh Annual ACM Symposium on Theory of Computing, pp. 51–58. STOC 2015, ACM, New York, NY, USA (2015). https://doi.org/10.1145/2746539.2746612
Baeza-Yates, R.A., Gavaldà, R., Navarro, G., Scheihing, R.: Bounding the expected length of longest common subsequences and forests. Theor. Comput. Syst. 32(4), 435–452 (1999). https://doi.org/10.1007/s002240000125
Bundschuh, R.: High precision simulations of the longest common subsequence problem. Eur. Phys. J. B - Condens. Matter Complex Syst. 22(4), 533–541 (2001). https://doi.org/10.1007/s100510170102
Calvo-Zaragoza, J., Oncina, J., de la Higuera, C.: Computing the expected edit distance from a string to a probabilistic finite-state automaton. Int. J. Found. Comput. Sci. 28(05), 603–621 (2017). https://doi.org/10.1142/S0129054117400093
Chakraborty, D., Das, D., Goldenberg, E., Koucky, M., Saks, M.: Approximating edit distance within constant factor in truly sub-quadratic time. In: 2018 IEEE 59th Annual Symposium on Foundations of Computer Science, pp. 979–990 (2018). https://doi.org/10.1109/FOCS.2018.00096
Chvátal, V., Sankoff, D.: Longest common subsequences of two random sequences. J. Appl. Probab. 12(2), 306–315 (1975). https://doi.org/10.2307/3212444
Dancík, V.: Expected length of longest common subsequences. Ph.D. thesis, University of Warwick (1994)
Ganguly, S., Mossel, E., Racz, M.Z.: Sequence assembly from corrupted shotgun reads. arXiv preprint arXiv:1601.07086 (2016)
Lueker, G.S.: Improved bounds on the average length of longest common subsequences. J. ACM 56(3), 17:1–17:38 (2009). https://doi.org/10.1145/1516512.1516519
Masek, W.J., Paterson, M.S.: A faster algorithm computing string edit distances. J. Comput. Syst. Sci. 20(1), 18–31 (1980). https://doi.org/10.1016/0022-0000(80)90002-1
Ning, K., Choi, K.P.: Systematic assessment of the expected length, variance and distribution of longest common subsequences. arXiv preprint arXiv:1306.4253 (2013)
Rubinstein, A.: Hardness of approximate nearest neighbor search. In: Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1260–1268. STOC 2018, ACM, New York, NY, USA (2018). https://doi.org/10.1145/3188745.3188916
Rubinstein, A., Song, Z.: Reducing approximate longest common subsequence to approximate edit distance. arXiv preprint arXiv:1904.05451 (2019)
Saw, J.G., Yang, M.C.K., Mo, T.C.: Chebyshev inequality with estimated mean and variance. Am. Stat. 38(2), 130–132 (1984). https://doi.org/10.1080/00031305.1984.10483182
Spencer, J.: Asymptopia. Am. Math. Soc., 71 (2014)
Steele, J.M.: Probability Theory and Combinatorial Optimization. SIAM, Philadelphia (1997)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Schimd, M., Bilardi, G. (2019). Bounds and Estimates on the Average Edit Distance. In: Brisaboa, N., Puglisi, S. (eds) String Processing and Information Retrieval. SPIRE 2019. Lecture Notes in Computer Science(), vol 11811. Springer, Cham. https://doi.org/10.1007/978-3-030-32686-9_7
Download citation
DOI: https://doi.org/10.1007/978-3-030-32686-9_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32685-2
Online ISBN: 978-3-030-32686-9
eBook Packages: Computer ScienceComputer Science (R0)