Enumerated Automata Implementation of String Dictionaries

Bakarić, Robert; Korenčić, Damir; Ristov, Strahil

doi:10.1007/978-3-030-23679-3_3

Robert Bakarić¹⁶,
Damir Korenčić¹⁶ &
Strahil Ristov¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11601))

Included in the following conference series:

International Conference on Implementation and Application of Automata

394 Accesses

Abstract

Over the last decade a considerable effort was invested into research on implementing string dictionaries. String dictionary is a data structure that bijectively maps a set of strings to a set of integers, and that is used in various index-based applications. A recent paper [18] can be regarded as a reference work on the subject of string dictionary implementations. Although very comprehensive, [18] does not cover the implementation of a string dictionary with the enumerated deterministic finite automaton, a data structure naturally suited for this purpose. We compare the results for the state-of-the-art compressed enumerated automaton with those presented in [18] on the same collection of data sets, and on the collection of natural language word lists. We show that our string dictionary implementation is a competitive variant for different types of data, especially when dealing with large sets of strings, and when strings have more similarity between them. In particular, our method presents as a prominent solution for storing DNA motifs and words of inflected natural languages. We provide the code used for the experiments.

Supported in part by Croatian Science Foundation grant No. IP-2018-01-7317 and European Regional Development Fund [KK.01.1.1.01.0009 - DATACROSS].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
There exists a certain ambiguity in the literature regarding usage of the term LZ trie. As employed in [22], and in this paper, the term denotes a specific data structure (and the corresponding method of construction) - a trie compressed with a variant of the LZ method; while in [19] LZTrie denotes a trie of phrases used in LZ compression procedure. This inconsistency is due to the simultaneous publication process of the two papers.

References

Arroyuelo, D., Cánovas, R., Navarro, G., Sadakane, K.: Succinct trees in practice. In: Blelloch, G.E., Halperin, D. (eds.) ALENEX 2010, pp. 84–97. SIAM, Philadelphia (2010). https://doi.org/10.1137/1.9781611972900.9
Chapter Google Scholar
Arz, J., Fischer, J.: LZ-compressed string dictionaries. In: DCC 2014, pp. 322–331. IEEE (2014). https://doi.org/10.1109/DCC.2014.36
Benoit, D., Demaine, E.D., Munro, J.I., Raman, R., Raman, V., Rao, S.S.: Representing trees of higher degree. Algorithmica 43(4), 275–292 (2005)
Article MathSciNet Google Scholar
Brisaboa, N.R., Cánovas, R., Claude, F., Martínez-Prieto, M.A., Navarro, G.: Compressed string dictionaries. In: Pardalos, P.M., Rebennack, S. (eds.) SEA 2011. LNCS, vol. 6630, pp. 136–147. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20662-7_12
Chapter Google Scholar
Daciuk, J., van Noord, G.: Finite automata for compact representation of language models in NLP. In: Watson, B.W., Wood, D. (eds.) CIAA 2001. LNCS, vol. 2494, pp. 65–73. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36390-4_6
Chapter MATH Google Scholar
Daciuk, J., van Noord, G.: Finite automata for compact representation of tuple dictionaries. Theor. Comput. Sci. 313(1), 45–56 (2004)
Article MathSciNet Google Scholar
Daciuk, J.: Experiments with automata compression. In: Yu, S., Păun, A. (eds.) CIAA 2000. LNCS, vol. 2088, pp. 105–112. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44674-5_8
Chapter MATH Google Scholar
Daciuk, J., Piskorski, J.: Gazetteer compression technique based on substructure recognition. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) IIPWM 2006. AINSC, vol. 35, pp. 87–95. Springer, Heidelberg (2006). https://doi.org/10.1007/3-540-33521-8_9
Chapter Google Scholar
Daciuk, J., Piskorski, J., Ristov, S.: Natural language dictionaries implemented as finite automata. In: Martín-Vide, C. (ed.) Mathematics, Computing, Language, and Life: Frontiers in Mathematical Linguistics and Language Theory, vol. 2, pp. 133–204. World Scientific & Imperial College Press, London (2010)
MATH Google Scholar
Daciuk, J., Weiss, D.: Smaller representation of finite state automata. In: Bouchou-Markhoff, B., Caron, P., Champarnaud, J.-M., Maurel, D. (eds.) CIAA 2011. LNCS, vol. 6807, pp. 118–129. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22256-6_12
Chapter MATH Google Scholar
Ferragina, P., Grossi, R., Gupta, A., Shah, R., Vitter, J.S.: On searching compressed string collections cache-obliviously. In: PODS 2008, pp. 181–190. ACM, New York (2008). https://doi.org/10.1145/1376916.1376943
Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Structuring labeled trees for optimal succinctness, and beyond. In: FOCS 2005, pp. 184–196. IEEE Computer Society (2005). https://doi.org/10.1109/SFCS.2005.69
Ferragina, P., Venturini, R.: The compressed permuterm index. ACM Trans. Algorithms 7(1), 10:1–10:21 (2010). https://doi.org/10.1145/1868237.1868248
Article MathSciNet MATH Google Scholar
Georgiev, K.: Compression of minimal acyclic deterministic FSAs preserving the linear accepting complexity. In: Mihov, S., Schulz, K.U. (eds.) Proceedings Workshop on Finite-State Techniques and Approximate Search 2007, pp. 7–13 (2007)
Google Scholar
Grossi, R., Ottaviano, G.: Fast compressed tries through path decompositions. ACM J. Exp. Algorithmics 19(1), 3.4:1.1–3.4:1.20 (2014)
MathSciNet MATH Google Scholar
Larsson, N.J., Moffat, A.: Off-line dictionary-based compression. Proc. IEEE 88(11), 1722–1732 (2000). https://doi.org/10.1109/5.892708
Article Google Scholar
Lucchesi, C.L., Kowaltowski, T.: Applications of finite automata representing large vocabularies. Softw. Pract. Exp. 23(1), 15–30 (1993)
Article Google Scholar
Martínez-Prieto, M.A., Brisaboa, N., Cánovas, R., Claude, F., Navarro, G.: Practical compressed string dictionaries. Inf. Syst. 56(C), 73–108 (2016)
Article Google Scholar
Navarro, G.: Indexing text using the Ziv-Lempel trie. J. Discret. Algorithms 2(1), 87–114 (2004). https://doi.org/10.1016/S1570-8667(03)00066-2
Article MathSciNet MATH Google Scholar
Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding \(k\)-ary trees and multisets. In: Eppstein, D. (ed.) Proceedings of SODA 2002, pp. 233–242. ACM/SIAM, Philadelphia (2002)
Google Scholar
Revuz, D.: Dictionnaires et lexiques: méthodes et algorithmes. Ph.D. thesis, Institut Blaise Pascal, Paris, France (1991)
Google Scholar
Ristov, S.: LZ trie and dictionary compression. Softw. Pract. Exp. 35(5), 445–465 (2005). https://doi.org/10.1002/spe.643
Article Google Scholar
Ristov, S., Korenčić, D.: Fast construction of space-optimized recursive automaton. Softw. Pract. Exp. 45(6), 783–799 (2014). https://doi.org/10.1002/spe.2261
Article Google Scholar
Ristov, Strahil, Laporte, Eric: Ziv Lempel compression of huge natural language data tries using suffix arrays. In: Crochemore, Maxime, Paterson, Mike (eds.) CPM 1999. LNCS, vol. 1645, pp. 196–211. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48452-3_15
Chapter Google Scholar
Skibiński, P., Grabowski, S., Deorowicz, S.: Revisiting dictionary-based compression. Softw. Pract. Exp. 35(15), 1455–1476 (2005). https://doi.org/10.1002/spe.678
Article Google Scholar
Tounsi, L., Bouchou, B., Maurel, D.: A compression method for natural language automata. In: FSMNLP 2008, pp. 146–157. IOS Press, Amsterdam (2009)
Google Scholar

Download references

Acknowledgment

We are grateful to Miguel Martínez-Prieto for kindly providing data sets used in [18].

Author information

Authors and Affiliations

Department of Electronics, Ruđer Bošković Institute, Bijenićka 54, 10000, Zagreb, Croatia
Robert Bakarić, Damir Korenčić & Strahil Ristov

Authors

Robert Bakarić
View author publications
You can also search for this author in PubMed Google Scholar
Damir Korenčić
View author publications
You can also search for this author in PubMed Google Scholar
Strahil Ristov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Strahil Ristov .

Editor information

Editors and Affiliations

Slovak Academy of Sciences, Košice, Slovakia
Michal Hospodár
Slovak Academy of Sciences, Košice, Slovakia
Galina Jirásková

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bakarić, R., Korenčić, D., Ristov, S. (2019). Enumerated Automata Implementation of String Dictionaries. In: Hospodár, M., Jirásková, G. (eds) Implementation and Application of Automata. CIAA 2019. Lecture Notes in Computer Science(), vol 11601. Springer, Cham. https://doi.org/10.1007/978-3-030-23679-3_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-23679-3_3
Published: 26 June 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23678-6
Online ISBN: 978-3-030-23679-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics