Abstract
In this paper, we describe a generalization for tree stochastic languages of the k-gram models. These models are based on the k-testable class, a subclass of the languages recognizable by ascending tree automata. One of the advantages of this approach is that the probabilistic model can be updated in an incremental fashion. Another feature is that backing-off schemes can be defined. As an illustration of their applicability, they have been used to compress tree data files at a better rate than string-based methods.
Work supported by the Spanish Comisión Interministerial de Ciencia y Tecnología through grant TIC2000-1599-C02.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jenifer Lai, and Robert L. Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479, 1992.
Rafael C. Carrasco, Mikel L. Forcada, M. Ángeles Valdés-Muñoz, and Ramón P. Neco. Stable encoding of finite-state machines in discrete-time recurrent neural nets with sigmoid units. Neural Computation, 12(9):2129–2174, 2000.
Eugene Charniak. Statistical Language Learning. MIT Press, 1993.
R. Chaudhuri, S. Pham, and O.N. Garcia. Solution of an open problem on probabilistic grammars. IEEE Transactions on Computers, 32(8):758–750, 1983.
K. L. Chung. Markov Chains with Stationary Transition Probabilities. Springer, Berlin, 2 edition, 1967.
John G. Cleary and Ian H. Witten. Data compression using adaptive coding and partial string matching. IEEE Transactions on Communicaton, 32(4):396–402, 1984.
Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley Series in Telecommunications. John Wiley & Sons, New York, NY, USA, 1991.
Pedro García. Learning k-testable tree sets from positive data. Technical Report DSIC-ii-1993-46, DSIC, Universidad Politécnica de Valencia, 1993.
Pedro García and Enrique Vidal. Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(9):920–925, sep 1990.
Frederick Jelinek. Statistical Methods for Speech Recognition. The MIT Press, Cambridge, Massachusetts, 1998.
T. Knuutila and M. Steinby. The inference of tree languages from finite samples: an algebraic approach. Theoretical Computer Science, 129:337–367, 1994.
Timo Knuutila. Inference of k-testable tree languages. In H. Bunke, editor, Advances in Structural and Syntactic Pattern Recognition (Proc. Intl. Workshop on Structural and Syntactic Pattern Recognition, Bern, Switzerland). World Scientific, aug 1993.
Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. Building a large annotated corpus of english: the penn treebank. Computational Linguistics, 19:313–330, 1993.
H. Ney, U. Essen, and R. Kneser. On the estimation of small probabilities by leaving-one-out. IEEE Trans. on Pattern Analysis and Machine Intelligence, 17(12):1202–1212, 1995.
Maurice Nivat and Andreas Podelski. Minimal ascending and descending tree automata. SIAM Journal on Computing, 26(1):39–58, 1997.
J.R. Rico-Juan, J. Calera-Rubio, and R.C. Carrasco. Stochastic k-testable tree languages and applications. http://www.dlsi.ua.es/~calera/fulltext02.ps.gz, 2002.
G. Rozenberg and A. Salomaa, editors. Handbook of Formal Languages Springer, 1997.
Frank Rubin. Experiments in text file compression. Communications of the ACM, 19(11):617–623, 1976.
Yasubumi Sakakibara. Efficient learning of context-free grammars from positive structural examples. Information and Computation, 97(1):23–60, March 1992.
J.A. Sánchez and J.M. Benedí. Consistency of stochastic context-free grammars from probabilistic estimation based on growth transformations. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(9):1052–1055, 1997.
Andreas Stolcke. An efficient context-free parsing algorithm that computes prefix probabilities. Computational Linguistics, 21(2): 165–201, 1995.
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kauffman Publishing, San Francisco, 2nd edition, 1999.
I. H. Witten, R.M. Neal, and J. G. Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520–540, 1987.
Takashi Yokomori. On polynomial-time learnability in the limit of strictly deterministic automata. Machine Learning, 19(2):153–179, 1995.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Rico-Juan, J.R., Calera-Rubio, J., Carrasco, R.C. (2002). Stochastic k-testable Tree Languages and Applications. In: Adriaans, P., Fernau, H., van Zaanen, M. (eds) Grammatical Inference: Algorithms and Applications. ICGI 2002. Lecture Notes in Computer Science(), vol 2484. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45790-9_16
Download citation
DOI: https://doi.org/10.1007/3-540-45790-9_16
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44239-4
Online ISBN: 978-3-540-45790-9
eBook Packages: Springer Book Archive