Abstract
The smallest grammar problem - namely, finding a smallest context-free grammar that generates exactly one sequence - is of practical and theoretical importance in fields such as Kolmogorov complexity, data compression and pattern discovery. We propose to focus on the choice of the occurrences to be rewritten by non-terminals. We extend classical offline algorithms by introducing a global optimization of this choice at each step of the algorithm. This approach allows us to define the search space of a smallest grammar by separating the choice of the non-terminals and the choice of their occurrences. We propose a second algorithm that performs a broader exploration by allowing the removal of useless words that were chosen previously. Experiments on a classical benchmark show that our algorithms consistently find smaller grammars then state-of-the-art algorithms.
The work described in this paper is partially supported by the Program of International Scientific Cooperation MINCyT - INRIA/CNRS.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Apostolico, A., Lonardi, S.: Off-line compression by greedy textual substitution. Proceedings of the IEEE (January 2000)
Arnold, R., Bell, T.: A corpus for the evaluation of lossless compression algorithms. In: Data Compression Conference, Washington, DC, USA, p. 201. IEEE Computer Society, Los Alamitos (1997)
Bentley, J., McIlroy, D.: Data compression using long common strings. In: Data Compression Conference, pp. 287–295 (March 1999)
Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., Shelat, A.: The smallest grammar problem. IEEE Transactions on Information Theory 51(7), 2554–2576 (2005)
Evans, S.C., Kourtidis, A., Markham, T., Miller, J.: MicroRNA target detection and analysis for genes related to breast cancer using MDLcompress. EURASIP Journal on Bioinformatics and Systems Biology (3) (2007)
Kieffer, J., Yang, E.H.: Grammar-based codes: a new class of universal lossless source codes. IEEE Transactions on Information Theory 46 (2000)
Klein, D.: The Unsupervised Learning of Natural Language Structure. PhD thesis, University of Stanford (2005)
Lanctot, J.K., Li, M., Yang, E.H.: Estimating DNA sequence entropy. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 409–418 (January 2000)
Larsson, N., Moffat, A.: Off-line dictionary-based compression. Proceedings of the IEEE 88(11), 1722–1732 (2000)
Marcken, C.D.: Unsupervised language acquisition. PhD thesis, Massachusetts Institute of Technology (January 1996)
Nakamura, R., Inenaga, S., Bannai, H., Funamoto, T., Takeda, M., Shinohara, A.: Linear-time text compression by longest-first substitution. Algorithms 2(4), 1429–1448 (2009)
Nevill-Manning, C., Witten, I.: On-line and off-line heuristics for inferring hierarchies of repetitions in sequences. In: Data Compression Conference, pp. 1745–1755. IEEE, Los Alamitos (2000)
Nevill-Manning, C.G.: Inferring Sequential Structure. PhD thesis, University of Waikato (1996)
Nevill-Manning, C.G., Witten, I.H.: Identifying hierarchical structure in sequences: A linear-time algorithm. Journal of Artificial Intelligence Research 7 (January 1997)
Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoretical Computer Science 302(1-3), 211–222 (2003)
Sakakibara, Y.: Efficient learning of context-free grammars from positive structural examples. Inf. Comput. 97(1), 23–60 (1992)
Sakamoto, H., Maruyama, S., Kida, T., Shimozono, S.: A space-saving approximation algorithm for grammar-based compression. IEICE Transactions 92-D(2), 158–165 (2009)
Schuegraf, E.J., Heaps, H.S.: A comparison of algorithms for data base compression by use of fragments as language elements. Information Storage and Retrieval 10, 309–319 (1974)
Wolff, J.: An algorithm for the segmentation of an artificial language analogue. British Journal of Psychology 66 (1975)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Carrascosa, R., Coste, F., Gallé, M., Infante-Lopez, G. (2010). Choosing Word Occurrences for the Smallest Grammar Problem. In: Dediu, AH., Fernau, H., Martín-Vide, C. (eds) Language and Automata Theory and Applications. LATA 2010. Lecture Notes in Computer Science, vol 6031. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13089-2_13
Download citation
DOI: https://doi.org/10.1007/978-3-642-13089-2_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13088-5
Online ISBN: 978-3-642-13089-2
eBook Packages: Computer ScienceComputer Science (R0)