Abstract
Many real-world applications such as spell-checking or DNA analysis use the Levenshtein edit-distance to compute similarities between strings. In practice, the costs of the primitive edit operations (insertion, deletion and substitution of symbols) are generally hand-tuned. In this paper, we propose an algorithm to learn these costs. The underlying model is a probabilitic transducer, computed by using grammatical inference techniques, that allows us to learn both the structure and the probabilities of the model. Beyond the fact that the learned transducers are neither deterministic nor stochastic in the standard terminology, they are conditional, thus independant from the distributions of the input strings. Finally, we show through experiments that our method allows us to design cost functions that depend on the string context where the edit operations are used. In other words, we get kinds of context-sensitive edit distances.
This work was supported in part by the IST Programme of the European Community, under the Pascal Network of Excellence, IST-2002-506778. This publication only reflects the authors’ views.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bilenko, M., Mooney, R.: Adaptive duplicate detection using learnable string similarity measures. In: Proc. of the 9th Int. Conf. on Knowledge Discovery and Data Mining (KDD 2003), pp. 39–48 (2003)
Bouchard, G., Triggs, B.: The tradeoff between generative and discriminative classifiers. In: Antoch, J. (ed.) Proc. in Computational Statistics (COMPSTAT 2004), 16th Symp. of IASC, Prague, vol. 16. Physica-Verlag, New York (2004)
Carrasco, R.C., Oncina, J.: Learning stochastic regular grammars by means of a state merging method. In: Carrasco, R.C., Oncina, J. (eds.) ICGI 1994. LNCS (LNAI), vol. 862, pp. 139–150. Springer, Heidelberg (1994)
Dempster, A., Laird, M., Rubin, D.: Maximun likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society B(39), 1–38 (1977)
Durbin, R., Eddy, S.R., Krogh, A., Mitchison, G.: Biological sequence analysis. Cambridge University Press, Cambridge (1998)
Eisner, J.: Parameter estimation for probabilistic finite-state transducers. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, pp. 1–8 (July 2002)
McCallum, A., Bellare, K., Pereira, P.: A conditional random field for discriminatively-trained finite-state string edit distance. In: Proc. 21th Annual Conference on Uncertainty in Artificial Intelligence (UAI 2005), Arlington, Virginia, pp. 388–400. AUAI Press (2005)
Oncina, J., Sebban, M.: Learning stochastic edit distance: application in handwritten character recognition. Journal of Pattern Recognition (to appear, 2006)
Ristad, E.S., Yianilos, P.N.: Learning string-edit distance. IEEE Trans. on Pattern Analysis and Machine Intelligence 20(5), 522–532 (1998)
Thollard, F., Dupont, P., de la Higuera, C.: Probabilistic DFA inference using kullback-leibler divergence and minimality. In: Proc. 17th Int. Conf. on Machine Learning (ICML 2000), pp. 975–982. Morgan Kaufmann, San Francisco (2000)
Vidal, E., Thollard, F., de la Higuera, C., Casacuberta, F., Carrasco, R.C.: Probabilistic finite-state machines. IEEE Trans. in Pattern Analysis and Machine Intelligence 27(7), 1013–1039 (2005)
Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. Journal of the ACM 21(1), 168–173 (1974)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bernard, M., Janodet, JC., Sebban, M. (2006). A Discriminative Model of Stochastic Edit Distance in the Form of a Conditional Transducer. In: Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E. (eds) Grammatical Inference: Algorithms and Applications. ICGI 2006. Lecture Notes in Computer Science(), vol 4201. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11872436_20
Download citation
DOI: https://doi.org/10.1007/11872436_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45264-5
Online ISBN: 978-3-540-45265-2
eBook Packages: Computer ScienceComputer Science (R0)