Abstract
This paper describes compact storage models for gazetteers using state-of-the-art finite-state technology. In particular, we compare the standard method based on numbered indexing automata associated with an auxiliary storage device, against a pure finite-state representation, the latter being superior in terms of space and time complexity, when applied to real-world test data. Further, we pinpoint some pros and cons for both approaches and provide results of empirical experiments, which form handy guidelines for selecting a suitable data structure for implementing a gazetteer.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Ciura, M.G., Deorowicz, S.: How to Squeeze a Lexicon. Software - Practice and Experience 31(11), 1077–1090 (2001)
Daciuk, J.: Incremental Construction of Finite-State Automata and Transducers. PhD Thesis. Technical University Gdańsk (1998)
Kowaltowski, T., Lucchesi, C.L.: Applications of Finite Automata Representing Large Vocabularies. TR DCC-01/92, University of Campinas, Brazil (1992)
Kowaltowski, T., Lucchesi, C.L., Stolfi, J.: Finite Automata and Efficient Lexicon Implementation. TR IC-98-02, University of Campinas, Brazil (1998)
Beijer, N.D., Watson, B.W., Kourie, D.G.: Stretching and Jamming of Automata. In: Proceedings of SAICSIT 2003, Rep. South Africa, pp. 198–207 (2003)
Drożdżyński, W., Krieger, H.U., Piskorski, J., Schäfer, U., Xu, F.: Shallow Processing with Unification and Typed Feature Structures — Foundations and Applications. Künstliche Intelligenz 2004(1), 17–23 (2004)
Daciuk, J., Mihov, S., Watson, B., Watson, R.: Incremental Construction of Minimal Acyclic Finite State Automata. Comp. Rep Linguistics 26(1), 3–16 (2000)
Daciuk, J., van Noord, G.: Finite Automata for Compact Representation of Language Models in NLP. Theoretical Computer Science 313(1) (2004)
Graña, J., Barcala, F.M., Alonso, M.A.: Compilation Methods of Minimal Acyclic Automata for Large Dictionaries. In: Watson, B.W., Wood, D. (eds.) CIAA 2001. LNCS, vol. 2494, pp. 135–148. Springer, Heidelberg (2003)
Vo, B., Vo, K.P.: Using Column Dependency to Compress Tables. In: Proceedings of the 2004 IEEE Data Compression Conference, pp. 92–101. IEEE Computer Society Press, Los Alamitos (2004)
Daciuk, J.: Experiments with Automata Compression. In: Yu, S., Păun, A. (eds.) CIAA 2000. LNCS, vol. 2088, pp. 113–119. Springer, Heidelberg (2000)
Mihov, S., Maurel, D.: Direct Construction of Minimal Acyclic Subsequential Transducers. In: Yu, S., Păun, A. (eds.) CIAA 2000. LNCS, vol. 2088, pp. 217–229. Springer, Heidelberg (2001)
Skut, W.: Incremental Construction of Minimal Acyclic Sequential Transducers from Unsorted Lexical Data. In: Proceedings of COLING 2004, Geneva, Switzerland (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Piskorski, J. (2006). On Compact Storage Models for Gazetteers. In: Yli-Jyrä, A., Karttunen, L., Karhumäki, J. (eds) Finite-State Methods and Natural Language Processing. FSMNLP 2005. Lecture Notes in Computer Science(), vol 4002. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11780885_22
Download citation
DOI: https://doi.org/10.1007/11780885_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35467-3
Online ISBN: 978-3-540-35469-7
eBook Packages: Computer ScienceComputer Science (R0)