Skip to main content
Log in

Compressed double-array tries for string dictionaries supporting fast lookup

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

A string dictionary is a basic tool for storing a set of strings in many kinds of applications. Recently, many applications need space-efficient dictionaries to handle very large datasets. In this paper, we propose new compressed string dictionaries using improved double-array tries. The double-array trie is a data structure that can implement a string dictionary supporting extremely fast lookup of strings, but its space efficiency is low. We introduce approaches for improving the disadvantage. From experimental evaluations, our dictionaries can provide the fastest lookup compared to state-of-the-art compressed string dictionaries. Moreover, the space efficiency is competitive in many cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Yet another Part-of-Speech and Morphological Analyzer at http://taku910.github.io/mecab/.

  2. An open-source full-text search engine and column store at http://groonga.org/.

  3. Operator \(\oplus \) denotes an XOR (exclusive OR) operation. While traditional implementations use a PLUS (\(+\)), the XOR (\(\oplus \)) is often substituted in recent ones such as [42] and Darts-clone at https://github.com/s-yata/darts-clone.

  4. https://dumps.wikimedia.org.

  5. http://download.geonames.org/export/dump/allCountries.zip.

  6. http://dist.s-yata.jp/corpus/nwc2010/ngrams/word/over999/filelist.

  7. http://data.law.di.unimi.it/webdata/uk-2005/uk-2005.urls.gz.

  8. http://pizzachili.dcc.uchile.cl/texts/dna/dna.gz.

  9. https://github.com/ot/path_decomposed_tries.

  10. https://github.com/migumar2/libCSD.

References

  1. Aoe J (1989) An efficient digital search algorithm by using a double-array structure. IEEE Trans Softw Eng 15(9):1066–1077

    Article  Google Scholar 

  2. Aoe J, Morimoto K (1992) An efficient implementation of trie structures. Softw Pract Exp 22(9):695–721

    Article  Google Scholar 

  3. Arroyuelo D, Cánovas R, Navarro G, Sadakane K (2010) Succinct trees in practice. In: Proceedings of the 11st meeting on algorithm engineering and experimentation (ALENEX), pp. 84–97

  4. Arz J, Fischer J (2014) LZ-compressed string dictionaries. In: Proceedings of the data compression conference (DCC), pp. 322–331

  5. Baeza-Yates R, Ribeiro-Neto B (2011) Modern information retrieval, 2nd edn. Addison Wesley, Boston

    Google Scholar 

  6. Bast H, Mortensen CW, Weber I (2008) Output-sensitive autocompletion search. Inf Retr 11(4):269–286

    Article  Google Scholar 

  7. Benoit D, Demaine ED, Munro JI, Raman R, Raman V, Rao SS (2005) Representing trees of higher degree. Algorithmica 43(4):275–292

    Article  MathSciNet  MATH  Google Scholar 

  8. Boldi P, Codenotti B, Santini M, Vigna S (2004) Ubicrawler: a scalable fully distributed web crawler. Softw Pract Exp 34(8):711–726

    Article  Google Scholar 

  9. Brisaboa NR, Ladra S, Navarro G (2013) DACs: bringing direct access to variable-length codes. Inf Process Manag 49(1):392–404

    Article  Google Scholar 

  10. Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. MIT press, Cambridge

    MATH  Google Scholar 

  11. Dundas JA (1991) Implementing dynamic minimal-prefix tries. Softw Pract Exp 21(10):1027–1040

    Article  Google Scholar 

  12. Ferragina P, Grossi R, Gupta A, Shah R, Vitter JS (2008) On searching compressed string collections cache-obliviously. In: Proceedings of the 27th symposium on principles of database systems (PODS), ACM, pp. 181–190

  13. Ferragina P, Luccio F, Manzini G, Muthukrishnan S (2009) Compressing and indexing labeled trees, with applications. J ACM 57(1):4

    Article  MathSciNet  MATH  Google Scholar 

  14. Fredkin E (1960) Trie memory. Commun ACM 3(9):490–499

    Article  Google Scholar 

  15. Fuketa M, Kitagawa H, Ogawa T, Morita K, Aoe J (2014) Compression of double array structures for fixed length keywords. Inf Process Manag 50(5):796–806

    Article  Google Scholar 

  16. Fuketa M, Morita K, Aoe J (2014) Comparisons of efficient implementations for DAWG. In: Proceedings of the 7th international conference on computer science and information technology (ICCSIT)

  17. González R, Grabowski S, Mäkinen V, Navarro G (2005) Practical implementation of rank and select queries. In: Poster proceedings of the 4th workshop on experimental and efficient a lgorithms (WEA), pp. 27–38

  18. Grossi R, Ottaviano G (2014) Fast compressed tries through path decompositions. ACM J Exp Algorithm 19(1):3–4

    MathSciNet  MATH  Google Scholar 

  19. Heaps HS (1978) Information retrieval: computational and theoretical aspects. Academic Press Inc, Orlando

    MATH  Google Scholar 

  20. Hu TC, Tucker AC (1971) Optimal computer search trees and variable-length alphabetical codes. SIAM J Appl Math 21(4):514–532

    Article  MathSciNet  MATH  Google Scholar 

  21. Kanda S, Fuketa M, Morita K, Aoe J (2016) A compression method of double-array structures using linear functions. Knowl Inf Syst 48(1):55–80

    Article  Google Scholar 

  22. Kim DK, Na JC, Kim JE, Park K (2005) Efficient implementation of rank and elect functions for succinct representation. Proceedings of the 4th international workshop on experimental and efficient algorithms (WEA), LNCS 3503. Springer, New York, pp 315–327

    Google Scholar 

  23. Knuth DE (1998) The art of computer programming, 3: sorting and searching, 2nd edn. Addison Wesley, Redwood City

    MATH  Google Scholar 

  24. Kudo T, Hanaoka T, Mukai J, Tabata Y, Komatsu H (2011) Efficient dictionary and language model compression for input method editors. In: Proceedings of the 1st workshop on advances in text input methods (WTIM), pp. 19–25

  25. Kudo T, Yamamoto K, Matsumoto Y (2004) Applying conditional random fields to Japanese morphological analysis. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp. 230–237

  26. Larsson NJ, Moffat A (1999) Offline dictionary-based compression. In: Proceedings of the data compression conference (DCC), pp. 296–305

  27. Maeda A, Mizushima K (2008) A compressed-array representation of automata and its application to programming language (in Japanese). In: Proceedings of the 49th IPSJ programming symposium, pp. 49–54

  28. Martínez-Prieto MA, Brisaboa N, Cánovas R, Claude F, Navarro G (2016) Practical compressed string dictionaries. Inf Syst 56:73–108

    Article  Google Scholar 

  29. Morita K, Fuketa M, Yamakawa Y, Aoe J (2001) Fast insertion methods of a double-array structure. Softw Pract Exp 31(1):43–65

    Article  MATH  Google Scholar 

  30. Munro JI, Raman V (2001) Succinct representation of balanced parentheses and static trees. SIAM J Comput 31(3):762–776

    Article  MathSciNet  MATH  Google Scholar 

  31. Navarro G, Sadakane K (2014) Fully functional static and dynamic succinct trees. ACM Trans Algorithms 10(3):16

    Article  MathSciNet  MATH  Google Scholar 

  32. Okanohara D, Sadakane K (2007) Practical entropy-compressed rank/select dictionary. In: Proceedings of the 9th meeting on algorithm engineering and expermiments (ALENEX), pp. 60–70

  33. Oono M, Atlam ES, Fuketa M, Morita K, Aoe J (2003) A fast and compact elimination method of empty elements from a double-array structure. Softw Pract Exp 33(13):1229–1249

    Article  Google Scholar 

  34. Salomon D (2008) A concise introduction to data compression. Springer, London

    Book  MATH  Google Scholar 

  35. Williams HE, Zobel J (1999) Compressing integers for fast file access. Comput J 42(3):193–201

    Article  Google Scholar 

  36. Witten IH, Moffat A, Bell TC (1999) Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, San Francisco

    MATH  Google Scholar 

  37. Yasuhara M, Tanaka T, Norimatsu J, Yamamoto M (2013) An efficient language model using double-array structures. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp. 222–232

  38. Yata S, Morita K, Fuketa M, Aoe J (2008) Fast string matching with space-efficient word graphs. In: Proceedings of the 4th international conference on innovations in information technology (IIT), pp. 79–83

  39. Yata S, Oono M, Morita K, Fuketa M, Aoe J (2007) An efficient deletion method for a minimal prefix double array. Softw Pract Exp 37(5):523–534

    Article  Google Scholar 

  40. Yata S, Oono M, Morita K, Fuketa M, Sumitomo T, Aoe J (2007) A compact static double-array keeping character codes. Inf Process Manag 43(1):237–247

    Article  Google Scholar 

  41. Yata S, Oono M, Morita K, Sumitomo T, Aoe J (2006) Double-array compression by pruning twin leaves and unifying common suffixes. In: Proceedings of the 1st international conference on computing and informatics (ICOCI), pp. 1–4

  42. Yoshinaga N, Kitsuregawa M (2014) A self-adaptive classifier for efficient text-stream processing. In: Proceedings of the 24th international conference on computational linguistics (COLING), pp. 1091–1102

  43. Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans Inf Theory 24(5):530–536

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shunsuke Kanda.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kanda, S., Morita, K. & Fuketa, M. Compressed double-array tries for string dictionaries supporting fast lookup. Knowl Inf Syst 51, 1023–1042 (2017). https://doi.org/10.1007/s10115-016-0999-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-016-0999-8

Keywords

Navigation