Abstract
Over the last decade a considerable effort was invested into research on implementing string dictionaries. String dictionary is a data structure that bijectively maps a set of strings to a set of integers, and that is used in various index-based applications. A recent paper [18] can be regarded as a reference work on the subject of string dictionary implementations. Although very comprehensive, [18] does not cover the implementation of a string dictionary with the enumerated deterministic finite automaton, a data structure naturally suited for this purpose. We compare the results for the state-of-the-art compressed enumerated automaton with those presented in [18] on the same collection of data sets, and on the collection of natural language word lists. We show that our string dictionary implementation is a competitive variant for different types of data, especially when dealing with large sets of strings, and when strings have more similarity between them. In particular, our method presents as a prominent solution for storing DNA motifs and words of inflected natural languages. We provide the code used for the experiments.
Supported in part by Croatian Science Foundation grant No. IP-2018-01-7317 and European Regional Development Fund [KK.01.1.1.01.0009 - DATACROSS].
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
There exists a certain ambiguity in the literature regarding usage of the term LZ trie. As employed in [22], and in this paper, the term denotes a specific data structure (and the corresponding method of construction) - a trie compressed with a variant of the LZ method; while in [19] LZTrie denotes a trie of phrases used in LZ compression procedure. This inconsistency is due to the simultaneous publication process of the two papers.
References
Arroyuelo, D., Cánovas, R., Navarro, G., Sadakane, K.: Succinct trees in practice. In: Blelloch, G.E., Halperin, D. (eds.) ALENEX 2010, pp. 84–97. SIAM, Philadelphia (2010). https://doi.org/10.1137/1.9781611972900.9
Arz, J., Fischer, J.: LZ-compressed string dictionaries. In: DCC 2014, pp. 322–331. IEEE (2014). https://doi.org/10.1109/DCC.2014.36
Benoit, D., Demaine, E.D., Munro, J.I., Raman, R., Raman, V., Rao, S.S.: Representing trees of higher degree. Algorithmica 43(4), 275–292 (2005)
Brisaboa, N.R., Cánovas, R., Claude, F., Martínez-Prieto, M.A., Navarro, G.: Compressed string dictionaries. In: Pardalos, P.M., Rebennack, S. (eds.) SEA 2011. LNCS, vol. 6630, pp. 136–147. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20662-7_12
Daciuk, J., van Noord, G.: Finite automata for compact representation of language models in NLP. In: Watson, B.W., Wood, D. (eds.) CIAA 2001. LNCS, vol. 2494, pp. 65–73. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36390-4_6
Daciuk, J., van Noord, G.: Finite automata for compact representation of tuple dictionaries. Theor. Comput. Sci. 313(1), 45–56 (2004)
Daciuk, J.: Experiments with automata compression. In: Yu, S., Păun, A. (eds.) CIAA 2000. LNCS, vol. 2088, pp. 105–112. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44674-5_8
Daciuk, J., Piskorski, J.: Gazetteer compression technique based on substructure recognition. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) IIPWM 2006. AINSC, vol. 35, pp. 87–95. Springer, Heidelberg (2006). https://doi.org/10.1007/3-540-33521-8_9
Daciuk, J., Piskorski, J., Ristov, S.: Natural language dictionaries implemented as finite automata. In: Martín-Vide, C. (ed.) Mathematics, Computing, Language, and Life: Frontiers in Mathematical Linguistics and Language Theory, vol. 2, pp. 133–204. World Scientific & Imperial College Press, London (2010)
Daciuk, J., Weiss, D.: Smaller representation of finite state automata. In: Bouchou-Markhoff, B., Caron, P., Champarnaud, J.-M., Maurel, D. (eds.) CIAA 2011. LNCS, vol. 6807, pp. 118–129. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22256-6_12
Ferragina, P., Grossi, R., Gupta, A., Shah, R., Vitter, J.S.: On searching compressed string collections cache-obliviously. In: PODS 2008, pp. 181–190. ACM, New York (2008). https://doi.org/10.1145/1376916.1376943
Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Structuring labeled trees for optimal succinctness, and beyond. In: FOCS 2005, pp. 184–196. IEEE Computer Society (2005). https://doi.org/10.1109/SFCS.2005.69
Ferragina, P., Venturini, R.: The compressed permuterm index. ACM Trans. Algorithms 7(1), 10:1–10:21 (2010). https://doi.org/10.1145/1868237.1868248
Georgiev, K.: Compression of minimal acyclic deterministic FSAs preserving the linear accepting complexity. In: Mihov, S., Schulz, K.U. (eds.) Proceedings Workshop on Finite-State Techniques and Approximate Search 2007, pp. 7–13 (2007)
Grossi, R., Ottaviano, G.: Fast compressed tries through path decompositions. ACM J. Exp. Algorithmics 19(1), 3.4:1.1–3.4:1.20 (2014)
Larsson, N.J., Moffat, A.: Off-line dictionary-based compression. Proc. IEEE 88(11), 1722–1732 (2000). https://doi.org/10.1109/5.892708
Lucchesi, C.L., Kowaltowski, T.: Applications of finite automata representing large vocabularies. Softw. Pract. Exp. 23(1), 15–30 (1993)
Martínez-Prieto, M.A., Brisaboa, N., Cánovas, R., Claude, F., Navarro, G.: Practical compressed string dictionaries. Inf. Syst. 56(C), 73–108 (2016)
Navarro, G.: Indexing text using the Ziv-Lempel trie. J. Discret. Algorithms 2(1), 87–114 (2004). https://doi.org/10.1016/S1570-8667(03)00066-2
Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding \(k\)-ary trees and multisets. In: Eppstein, D. (ed.) Proceedings of SODA 2002, pp. 233–242. ACM/SIAM, Philadelphia (2002)
Revuz, D.: Dictionnaires et lexiques: méthodes et algorithmes. Ph.D. thesis, Institut Blaise Pascal, Paris, France (1991)
Ristov, S.: LZ trie and dictionary compression. Softw. Pract. Exp. 35(5), 445–465 (2005). https://doi.org/10.1002/spe.643
Ristov, S., Korenčić, D.: Fast construction of space-optimized recursive automaton. Softw. Pract. Exp. 45(6), 783–799 (2014). https://doi.org/10.1002/spe.2261
Ristov, Strahil, Laporte, Eric: Ziv Lempel compression of huge natural language data tries using suffix arrays. In: Crochemore, Maxime, Paterson, Mike (eds.) CPM 1999. LNCS, vol. 1645, pp. 196–211. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48452-3_15
Skibiński, P., Grabowski, S., Deorowicz, S.: Revisiting dictionary-based compression. Softw. Pract. Exp. 35(15), 1455–1476 (2005). https://doi.org/10.1002/spe.678
Tounsi, L., Bouchou, B., Maurel, D.: A compression method for natural language automata. In: FSMNLP 2008, pp. 146–157. IOS Press, Amsterdam (2009)
Acknowledgment
We are grateful to Miguel Martínez-Prieto for kindly providing data sets used in [18].
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Bakarić, R., Korenčić, D., Ristov, S. (2019). Enumerated Automata Implementation of String Dictionaries. In: Hospodár, M., Jirásková, G. (eds) Implementation and Application of Automata. CIAA 2019. Lecture Notes in Computer Science(), vol 11601. Springer, Cham. https://doi.org/10.1007/978-3-030-23679-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-23679-3_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23678-6
Online ISBN: 978-3-030-23679-3
eBook Packages: Computer ScienceComputer Science (R0)