Skip to main content

Enumerated Automata Implementation of String Dictionaries

  • Conference paper
  • First Online:
Implementation and Application of Automata (CIAA 2019)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 11601))

Included in the following conference series:

  • 394 Accesses

Abstract

Over the last decade a considerable effort was invested into research on implementing string dictionaries. String dictionary is a data structure that bijectively maps a set of strings to a set of integers, and that is used in various index-based applications. A recent paper [18] can be regarded as a reference work on the subject of string dictionary implementations. Although very comprehensive, [18] does not cover the implementation of a string dictionary with the enumerated deterministic finite automaton, a data structure naturally suited for this purpose. We compare the results for the state-of-the-art compressed enumerated automaton with those presented in [18] on the same collection of data sets, and on the collection of natural language word lists. We show that our string dictionary implementation is a competitive variant for different types of data, especially when dealing with large sets of strings, and when strings have more similarity between them. In particular, our method presents as a prominent solution for storing DNA motifs and words of inflected natural languages. We provide the code used for the experiments.

Supported in part by Croatian Science Foundation grant No. IP-2018-01-7317 and European Regional Development Fund [KK.01.1.1.01.0009 - DATACROSS].

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    There exists a certain ambiguity in the literature regarding usage of the term LZ trie. As employed in [22], and in this paper, the term denotes a specific data structure (and the corresponding method of construction) - a trie compressed with a variant of the LZ method; while in [19] LZTrie denotes a trie of phrases used in LZ compression procedure. This inconsistency is due to the simultaneous publication process of the two papers.

References

  1. Arroyuelo, D., Cánovas, R., Navarro, G., Sadakane, K.: Succinct trees in practice. In: Blelloch, G.E., Halperin, D. (eds.) ALENEX 2010, pp. 84–97. SIAM, Philadelphia (2010). https://doi.org/10.1137/1.9781611972900.9

    Chapter  Google Scholar 

  2. Arz, J., Fischer, J.: LZ-compressed string dictionaries. In: DCC 2014, pp. 322–331. IEEE (2014). https://doi.org/10.1109/DCC.2014.36

  3. Benoit, D., Demaine, E.D., Munro, J.I., Raman, R., Raman, V., Rao, S.S.: Representing trees of higher degree. Algorithmica 43(4), 275–292 (2005)

    Article  MathSciNet  Google Scholar 

  4. Brisaboa, N.R., Cánovas, R., Claude, F., Martínez-Prieto, M.A., Navarro, G.: Compressed string dictionaries. In: Pardalos, P.M., Rebennack, S. (eds.) SEA 2011. LNCS, vol. 6630, pp. 136–147. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20662-7_12

    Chapter  Google Scholar 

  5. Daciuk, J., van Noord, G.: Finite automata for compact representation of language models in NLP. In: Watson, B.W., Wood, D. (eds.) CIAA 2001. LNCS, vol. 2494, pp. 65–73. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-36390-4_6

    Chapter  MATH  Google Scholar 

  6. Daciuk, J., van Noord, G.: Finite automata for compact representation of tuple dictionaries. Theor. Comput. Sci. 313(1), 45–56 (2004)

    Article  MathSciNet  Google Scholar 

  7. Daciuk, J.: Experiments with automata compression. In: Yu, S., Păun, A. (eds.) CIAA 2000. LNCS, vol. 2088, pp. 105–112. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-44674-5_8

    Chapter  MATH  Google Scholar 

  8. Daciuk, J., Piskorski, J.: Gazetteer compression technique based on substructure recognition. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) IIPWM 2006. AINSC, vol. 35, pp. 87–95. Springer, Heidelberg (2006). https://doi.org/10.1007/3-540-33521-8_9

    Chapter  Google Scholar 

  9. Daciuk, J., Piskorski, J., Ristov, S.: Natural language dictionaries implemented as finite automata. In: Martín-Vide, C. (ed.) Mathematics, Computing, Language, and Life: Frontiers in Mathematical Linguistics and Language Theory, vol. 2, pp. 133–204. World Scientific & Imperial College Press, London (2010)

    MATH  Google Scholar 

  10. Daciuk, J., Weiss, D.: Smaller representation of finite state automata. In: Bouchou-Markhoff, B., Caron, P., Champarnaud, J.-M., Maurel, D. (eds.) CIAA 2011. LNCS, vol. 6807, pp. 118–129. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-22256-6_12

    Chapter  MATH  Google Scholar 

  11. Ferragina, P., Grossi, R., Gupta, A., Shah, R., Vitter, J.S.: On searching compressed string collections cache-obliviously. In: PODS 2008, pp. 181–190. ACM, New York (2008). https://doi.org/10.1145/1376916.1376943

  12. Ferragina, P., Luccio, F., Manzini, G., Muthukrishnan, S.: Structuring labeled trees for optimal succinctness, and beyond. In: FOCS 2005, pp. 184–196. IEEE Computer Society (2005). https://doi.org/10.1109/SFCS.2005.69

  13. Ferragina, P., Venturini, R.: The compressed permuterm index. ACM Trans. Algorithms 7(1), 10:1–10:21 (2010). https://doi.org/10.1145/1868237.1868248

    Article  MathSciNet  MATH  Google Scholar 

  14. Georgiev, K.: Compression of minimal acyclic deterministic FSAs preserving the linear accepting complexity. In: Mihov, S., Schulz, K.U. (eds.) Proceedings Workshop on Finite-State Techniques and Approximate Search 2007, pp. 7–13 (2007)

    Google Scholar 

  15. Grossi, R., Ottaviano, G.: Fast compressed tries through path decompositions. ACM J. Exp. Algorithmics 19(1), 3.4:1.1–3.4:1.20 (2014)

    MathSciNet  MATH  Google Scholar 

  16. Larsson, N.J., Moffat, A.: Off-line dictionary-based compression. Proc. IEEE 88(11), 1722–1732 (2000). https://doi.org/10.1109/5.892708

    Article  Google Scholar 

  17. Lucchesi, C.L., Kowaltowski, T.: Applications of finite automata representing large vocabularies. Softw. Pract. Exp. 23(1), 15–30 (1993)

    Article  Google Scholar 

  18. Martínez-Prieto, M.A., Brisaboa, N., Cánovas, R., Claude, F., Navarro, G.: Practical compressed string dictionaries. Inf. Syst. 56(C), 73–108 (2016)

    Article  Google Scholar 

  19. Navarro, G.: Indexing text using the Ziv-Lempel trie. J. Discret. Algorithms 2(1), 87–114 (2004). https://doi.org/10.1016/S1570-8667(03)00066-2

    Article  MathSciNet  MATH  Google Scholar 

  20. Raman, R., Raman, V., Rao, S.S.: Succinct indexable dictionaries with applications to encoding \(k\)-ary trees and multisets. In: Eppstein, D. (ed.) Proceedings of SODA 2002, pp. 233–242. ACM/SIAM, Philadelphia (2002)

    Google Scholar 

  21. Revuz, D.: Dictionnaires et lexiques: méthodes et algorithmes. Ph.D. thesis, Institut Blaise Pascal, Paris, France (1991)

    Google Scholar 

  22. Ristov, S.: LZ trie and dictionary compression. Softw. Pract. Exp. 35(5), 445–465 (2005). https://doi.org/10.1002/spe.643

    Article  Google Scholar 

  23. Ristov, S., Korenčić, D.: Fast construction of space-optimized recursive automaton. Softw. Pract. Exp. 45(6), 783–799 (2014). https://doi.org/10.1002/spe.2261

    Article  Google Scholar 

  24. Ristov, Strahil, Laporte, Eric: Ziv Lempel compression of huge natural language data tries using suffix arrays. In: Crochemore, Maxime, Paterson, Mike (eds.) CPM 1999. LNCS, vol. 1645, pp. 196–211. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48452-3_15

    Chapter  Google Scholar 

  25. Skibiński, P., Grabowski, S., Deorowicz, S.: Revisiting dictionary-based compression. Softw. Pract. Exp. 35(15), 1455–1476 (2005). https://doi.org/10.1002/spe.678

    Article  Google Scholar 

  26. Tounsi, L., Bouchou, B., Maurel, D.: A compression method for natural language automata. In: FSMNLP 2008, pp. 146–157. IOS Press, Amsterdam (2009)

    Google Scholar 

Download references

Acknowledgment

We are grateful to Miguel Martínez-Prieto for kindly providing data sets used in [18].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Strahil Ristov .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bakarić, R., Korenčić, D., Ristov, S. (2019). Enumerated Automata Implementation of String Dictionaries. In: Hospodár, M., Jirásková, G. (eds) Implementation and Application of Automata. CIAA 2019. Lecture Notes in Computer Science(), vol 11601. Springer, Cham. https://doi.org/10.1007/978-3-030-23679-3_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-23679-3_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-23678-6

  • Online ISBN: 978-3-030-23679-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics