Abstract
This paper introduces a method to compress, store, and search a dictionary of a natural language. The dictionary can be represented as groups of words derived form a stem. We describe how to represent and store a word group in a way that is compact and efficiently searchable. The compression efficiency of the used algorithm highly depends on the quality of the information source. The currently available tools and data sources contain several mistakes, which can be cleaned by the introduced method. The paper also analyzes the efficiency of XML and two binary formats, and proposes two methods: directed acyclic graph transformation and word group regrouping that can be used to increase efficiency.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Buneman, P., Grohe, M., Koch, C.: Path queries on compressed XML. In: Freytag, J.C., et al. (eds.) Proc. VLDB 2003, pp. 141–152. Morgan Kaufmann, San Francisco (2003)
Németh, D.: Parallel dictionary compression using grid technologies. In: Lirkov, I., Margenov, S., Waśniewski, J. (eds.) LSSC 2007. LNCS, vol. 4818, pp. 492–499. Springer, Heidelberg (2008)
Németh, L., Trón, V., Halácsy, P., Kornai, A., Rung, A., Szakadát, I.: Leveraging the open-source ispell codebase for minority language analysis. In: Proceedings of SALTMIL 2004. European Language Resources Association, pp. 56–59 (2004)
Viktor, T., Gyögy, G., Péter, H., András, K., László, N., Dániel, V.: Hunmorph: open source word analysis. In: Proceedings of the ACL Workshop on Software, Association for Computational Linguistics, Ann Arbor, Michigan, pp. 77–85 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Németh, D., Lakat, M., Szeberényi, I. (2010). Dictionary Compression and Information Source Correction. In: Lirkov, I., Margenov, S., Waśniewski, J. (eds) Large-Scale Scientific Computing. LSSC 2009. Lecture Notes in Computer Science, vol 5910. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-12535-5_61
Download citation
DOI: https://doi.org/10.1007/978-3-642-12535-5_61
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-12534-8
Online ISBN: 978-3-642-12535-5
eBook Packages: Computer ScienceComputer Science (R0)