Bounds and Estimates on the Average Edit Distance

Schimd, Michele; Bilardi, Gianfranco

doi:10.1007/978-3-030-32686-9_7

Bounds and Estimates on the Average Edit Distance

Conference paper
First Online: 03 October 2019

708 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11811))

Abstract

The edit distance is a metric of dissimilarity between strings, widely applied in computational biology, speech recognition, and machine learning. Let \(e_k(n)\) denote the average edit distance between random, independent strings of n characters from an alphabet of a given size k. An open problem is the exact value of \(\alpha _{k}(n)= e_k(n)/n\). While it is known that, for increasing n, \(\alpha _{k}(n)\) approaches a limit \(\alpha _{k}\), the exact value of this limit is unknown, for any \(k\ge 2\). This paper presents an upper bound to \(\alpha _{k}\) based on the exact computation of some \(\alpha _k(n)\) and a lower bound to \(\alpha _{k}\) based on combinatorial arguments on edit scripts. Statistical estimates of \(\alpha _{k}(n)\) are also obtained, with analysis of error and of confidence intervals. The techniques are applied to several alphabet sizes k. In particular, for a binary alphabet, the rigorous bounds are \(0.1742 \le \alpha _2 \le 0.3693\) while the obtained estimate is \(\alpha _2 \approx 0.2888\); for a quaternary alphabet, \(0.3598 \le \alpha _4 \le 0.6318\) and \(\alpha _4 \approx 0.5180\). These values are more accurate than those previously published.

This work was partially supported by University of Padova projects CPDA152255/15 and CPGA3/13; by MIUR, the Italian Ministry of Education, University and Research, under Grant 20174LF3T8 AHeAD: efficient Algorithms for HArnessing networked Data; and by an IBM SUR Grant.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
A similar algorithm computes the length of the LCS. The recurrence (3) becomes \(M_{i,0} = 0\), \(M_{0,j} = 0\), and \(M_{i,j} = \max {\{ M_{i-1,j-1} + (1-\xi _{i,j}) ; M_{i-1,j} ; M_{i,j-1} \}}.\)

References

Abboud, A., Backurs, A., Williams, V.V.: Tight hardness results for LCS and other sequence similarity measures. In: 2015 IEEE 56th Annual Symposium on Foundations of Computer Science, pp. 59–78 (2015). https://doi.org/10.1109/FOCS.2015.14
Andoni, A., Krauthgamer, R., Onak, K.: Polylogarithmic approximation for edit distance and the asymmetric query complexity. In: 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 377–386 (2010). https://doi.org/10.1109/FOCS.2010.43
Backurs, A., Indyk, P.: Edit distance cannot be computed in strongly subquadratic time (unless seth is false). In: Proceedings of the Forty-seventh Annual ACM Symposium on Theory of Computing, pp. 51–58. STOC 2015, ACM, New York, NY, USA (2015). https://doi.org/10.1145/2746539.2746612
Baeza-Yates, R.A., Gavaldà, R., Navarro, G., Scheihing, R.: Bounding the expected length of longest common subsequences and forests. Theor. Comput. Syst. 32(4), 435–452 (1999). https://doi.org/10.1007/s002240000125
Article MathSciNet MATH Google Scholar
Bundschuh, R.: High precision simulations of the longest common subsequence problem. Eur. Phys. J. B - Condens. Matter Complex Syst. 22(4), 533–541 (2001). https://doi.org/10.1007/s100510170102
Article Google Scholar
Calvo-Zaragoza, J., Oncina, J., de la Higuera, C.: Computing the expected edit distance from a string to a probabilistic finite-state automaton. Int. J. Found. Comput. Sci. 28(05), 603–621 (2017). https://doi.org/10.1142/S0129054117400093
Article MathSciNet MATH Google Scholar
Chakraborty, D., Das, D., Goldenberg, E., Koucky, M., Saks, M.: Approximating edit distance within constant factor in truly sub-quadratic time. In: 2018 IEEE 59th Annual Symposium on Foundations of Computer Science, pp. 979–990 (2018). https://doi.org/10.1109/FOCS.2018.00096
Chvátal, V., Sankoff, D.: Longest common subsequences of two random sequences. J. Appl. Probab. 12(2), 306–315 (1975). https://doi.org/10.2307/3212444
Article MathSciNet MATH Google Scholar
Dancík, V.: Expected length of longest common subsequences. Ph.D. thesis, University of Warwick (1994)
Google Scholar
Ganguly, S., Mossel, E., Racz, M.Z.: Sequence assembly from corrupted shotgun reads. arXiv preprint arXiv:1601.07086 (2016)
Lueker, G.S.: Improved bounds on the average length of longest common subsequences. J. ACM 56(3), 17:1–17:38 (2009). https://doi.org/10.1145/1516512.1516519
Article MathSciNet MATH Google Scholar
Masek, W.J., Paterson, M.S.: A faster algorithm computing string edit distances. J. Comput. Syst. Sci. 20(1), 18–31 (1980). https://doi.org/10.1016/0022-0000(80)90002-1
Article MathSciNet MATH Google Scholar
Ning, K., Choi, K.P.: Systematic assessment of the expected length, variance and distribution of longest common subsequences. arXiv preprint arXiv:1306.4253 (2013)
Rubinstein, A.: Hardness of approximate nearest neighbor search. In: Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1260–1268. STOC 2018, ACM, New York, NY, USA (2018). https://doi.org/10.1145/3188745.3188916
Rubinstein, A., Song, Z.: Reducing approximate longest common subsequence to approximate edit distance. arXiv preprint arXiv:1904.05451 (2019)
Saw, J.G., Yang, M.C.K., Mo, T.C.: Chebyshev inequality with estimated mean and variance. Am. Stat. 38(2), 130–132 (1984). https://doi.org/10.1080/00031305.1984.10483182
Article MathSciNet Google Scholar
Spencer, J.: Asymptopia. Am. Math. Soc., 71 (2014)
Google Scholar
Steele, J.M.: Probability Theory and Combinatorial Optimization. SIAM, Philadelphia (1997)
Book Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Engineering, University of Padova, Padua, Italy
Michele Schimd & Gianfranco Bilardi

Authors

Michele Schimd
View author publications
You can also search for this author in PubMed Google Scholar
Gianfranco Bilardi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michele Schimd .

Editor information

Editors and Affiliations

University of A Coruña, A Coruña, Spain
Nieves R. Brisaboa
University of Helsinki, Helsinki, Finland
Simon J. Puglisi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schimd, M., Bilardi, G. (2019). Bounds and Estimates on the Average Edit Distance. In: Brisaboa, N., Puglisi, S. (eds) String Processing and Information Retrieval. SPIRE 2019. Lecture Notes in Computer Science(), vol 11811. Springer, Cham. https://doi.org/10.1007/978-3-030-32686-9_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-32686-9_7
Published: 03 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32685-2
Online ISBN: 978-3-030-32686-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics