Abstract
Finite-state automata are state-of-the-art representation of dictionaries in natural language processing. We present a novel compression technique that is especially useful for gazetteers – a particular sort of dictionaries. We replace common substructures in the automaton by unique copies. To find them, we treat a transition vector as a string, and we apply a Ziv-Lempel-style text compression technique that uses suffix tree to find repetitions in lineaqr time. Empirical evaluation on real-world data reveals space savings of up to 18,6%, which makes this method highly attractive.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
1. Daciuk J. 2000. Experiments with Automata Compression. Proceedings of CIAA - Implementation and Application of Automata, London, Ontario, Canada, 105–112
2. Daciuk J., Mihov S., Watson B., Watson R. 2000. Incremental Construction of Minimal Acyclic Finite State Automata. Computational Linguistics, 26(1), pages 3–16
3. Drozdzyński, W., Krieger H-U., Piskorski, J., Schäfer, U., Xu, F. Shallow Processing with Uni.cation and Typed Feature Structures — Foundations and Applications. In Künstliche Intelligenz, 2004(1), pages 17–23
4. Dan Gus.eld 1997). Algorithms on Strings, Trees, and Sequences. Cambridge University Press.
5. Hopcroft J. 1971. An nlogn Algorithm for Minimizing the states in a Finite Automaton. The Theory of Machines and Computations, Academic Press, 189–196.
6. Nederhof, M.-J. 2000. Practical experiments with regular approximation of context-free languages. Journal of Computational Linguistics, 26(1), pages 17–44
7. Kowaltowski T, Lucchesi C. and Stol. J. 1993. Minimization of Binary Automata. Proceedings of the First South American String Processing Workshop, Belo Horizonte, Brasil.
8. Kowaltowski T., Lucchesi C., Stol. J. 1998. Finite Automata and E.cient Lexicon Implementation. Technical Report IC-98–02, University of Campinas, Brazil.
9. Piskorski J. 2005). On Compact Storage Models for Gazetteers. Proceedings of the 5th InternationalWorkshop on Finite-State Methods and Natural Language Processing, Helsinki, Finland, Springer LNAI.
10. Revuz D. 1991. Dictionnaires et Lexiques, Méthodes et Algorithmes. PhD Thesis, Université Paris 7.
11. Tarjan R, and Andrew Chi-Chih Yao. (1979) Storing a sparse table. Commun. ACM. 22(11), ACM Press
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer
About this paper
Cite this paper
Daciuk, J., Piskorski, J. (2006). Gazetteer Compression Technique Based on Substructure Recognition. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds) Intelligent Information Processing and Web Mining. Advances in Soft Computing, vol 35. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-33521-8_9
Download citation
DOI: https://doi.org/10.1007/3-540-33521-8_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33520-7
Online ISBN: 978-3-540-33521-4
eBook Packages: EngineeringEngineering (R0)