Abstract
A string dictionary is a basic tool for storing a set of strings in many kinds of applications. Recently, many applications need space-efficient dictionaries to handle very large datasets. In this paper, we propose new compressed string dictionaries using improved double-array tries. The double-array trie is a data structure that can implement a string dictionary supporting extremely fast lookup of strings, but its space efficiency is low. We introduce approaches for improving the disadvantage. From experimental evaluations, our dictionaries can provide the fastest lookup compared to state-of-the-art compressed string dictionaries. Moreover, the space efficiency is competitive in many cases.
Similar content being viewed by others
Notes
Yet another Part-of-Speech and Morphological Analyzer at http://taku910.github.io/mecab/.
An open-source full-text search engine and column store at http://groonga.org/.
Operator \(\oplus \) denotes an XOR (exclusive OR) operation. While traditional implementations use a PLUS (\(+\)), the XOR (\(\oplus \)) is often substituted in recent ones such as [42] and Darts-clone at https://github.com/s-yata/darts-clone.
References
Aoe J (1989) An efficient digital search algorithm by using a double-array structure. IEEE Trans Softw Eng 15(9):1066–1077
Aoe J, Morimoto K (1992) An efficient implementation of trie structures. Softw Pract Exp 22(9):695–721
Arroyuelo D, Cánovas R, Navarro G, Sadakane K (2010) Succinct trees in practice. In: Proceedings of the 11st meeting on algorithm engineering and experimentation (ALENEX), pp. 84–97
Arz J, Fischer J (2014) LZ-compressed string dictionaries. In: Proceedings of the data compression conference (DCC), pp. 322–331
Baeza-Yates R, Ribeiro-Neto B (2011) Modern information retrieval, 2nd edn. Addison Wesley, Boston
Bast H, Mortensen CW, Weber I (2008) Output-sensitive autocompletion search. Inf Retr 11(4):269–286
Benoit D, Demaine ED, Munro JI, Raman R, Raman V, Rao SS (2005) Representing trees of higher degree. Algorithmica 43(4):275–292
Boldi P, Codenotti B, Santini M, Vigna S (2004) Ubicrawler: a scalable fully distributed web crawler. Softw Pract Exp 34(8):711–726
Brisaboa NR, Ladra S, Navarro G (2013) DACs: bringing direct access to variable-length codes. Inf Process Manag 49(1):392–404
Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. MIT press, Cambridge
Dundas JA (1991) Implementing dynamic minimal-prefix tries. Softw Pract Exp 21(10):1027–1040
Ferragina P, Grossi R, Gupta A, Shah R, Vitter JS (2008) On searching compressed string collections cache-obliviously. In: Proceedings of the 27th symposium on principles of database systems (PODS), ACM, pp. 181–190
Ferragina P, Luccio F, Manzini G, Muthukrishnan S (2009) Compressing and indexing labeled trees, with applications. J ACM 57(1):4
Fredkin E (1960) Trie memory. Commun ACM 3(9):490–499
Fuketa M, Kitagawa H, Ogawa T, Morita K, Aoe J (2014) Compression of double array structures for fixed length keywords. Inf Process Manag 50(5):796–806
Fuketa M, Morita K, Aoe J (2014) Comparisons of efficient implementations for DAWG. In: Proceedings of the 7th international conference on computer science and information technology (ICCSIT)
González R, Grabowski S, Mäkinen V, Navarro G (2005) Practical implementation of rank and select queries. In: Poster proceedings of the 4th workshop on experimental and efficient a lgorithms (WEA), pp. 27–38
Grossi R, Ottaviano G (2014) Fast compressed tries through path decompositions. ACM J Exp Algorithm 19(1):3–4
Heaps HS (1978) Information retrieval: computational and theoretical aspects. Academic Press Inc, Orlando
Hu TC, Tucker AC (1971) Optimal computer search trees and variable-length alphabetical codes. SIAM J Appl Math 21(4):514–532
Kanda S, Fuketa M, Morita K, Aoe J (2016) A compression method of double-array structures using linear functions. Knowl Inf Syst 48(1):55–80
Kim DK, Na JC, Kim JE, Park K (2005) Efficient implementation of rank and elect functions for succinct representation. Proceedings of the 4th international workshop on experimental and efficient algorithms (WEA), LNCS 3503. Springer, New York, pp 315–327
Knuth DE (1998) The art of computer programming, 3: sorting and searching, 2nd edn. Addison Wesley, Redwood City
Kudo T, Hanaoka T, Mukai J, Tabata Y, Komatsu H (2011) Efficient dictionary and language model compression for input method editors. In: Proceedings of the 1st workshop on advances in text input methods (WTIM), pp. 19–25
Kudo T, Yamamoto K, Matsumoto Y (2004) Applying conditional random fields to Japanese morphological analysis. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp. 230–237
Larsson NJ, Moffat A (1999) Offline dictionary-based compression. In: Proceedings of the data compression conference (DCC), pp. 296–305
Maeda A, Mizushima K (2008) A compressed-array representation of automata and its application to programming language (in Japanese). In: Proceedings of the 49th IPSJ programming symposium, pp. 49–54
Martínez-Prieto MA, Brisaboa N, Cánovas R, Claude F, Navarro G (2016) Practical compressed string dictionaries. Inf Syst 56:73–108
Morita K, Fuketa M, Yamakawa Y, Aoe J (2001) Fast insertion methods of a double-array structure. Softw Pract Exp 31(1):43–65
Munro JI, Raman V (2001) Succinct representation of balanced parentheses and static trees. SIAM J Comput 31(3):762–776
Navarro G, Sadakane K (2014) Fully functional static and dynamic succinct trees. ACM Trans Algorithms 10(3):16
Okanohara D, Sadakane K (2007) Practical entropy-compressed rank/select dictionary. In: Proceedings of the 9th meeting on algorithm engineering and expermiments (ALENEX), pp. 60–70
Oono M, Atlam ES, Fuketa M, Morita K, Aoe J (2003) A fast and compact elimination method of empty elements from a double-array structure. Softw Pract Exp 33(13):1229–1249
Salomon D (2008) A concise introduction to data compression. Springer, London
Williams HE, Zobel J (1999) Compressing integers for fast file access. Comput J 42(3):193–201
Witten IH, Moffat A, Bell TC (1999) Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, San Francisco
Yasuhara M, Tanaka T, Norimatsu J, Yamamoto M (2013) An efficient language model using double-array structures. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp. 222–232
Yata S, Morita K, Fuketa M, Aoe J (2008) Fast string matching with space-efficient word graphs. In: Proceedings of the 4th international conference on innovations in information technology (IIT), pp. 79–83
Yata S, Oono M, Morita K, Fuketa M, Aoe J (2007) An efficient deletion method for a minimal prefix double array. Softw Pract Exp 37(5):523–534
Yata S, Oono M, Morita K, Fuketa M, Sumitomo T, Aoe J (2007) A compact static double-array keeping character codes. Inf Process Manag 43(1):237–247
Yata S, Oono M, Morita K, Sumitomo T, Aoe J (2006) Double-array compression by pruning twin leaves and unifying common suffixes. In: Proceedings of the 1st international conference on computing and informatics (ICOCI), pp. 1–4
Yoshinaga N, Kitsuregawa M (2014) A self-adaptive classifier for efficient text-stream processing. In: Proceedings of the 24th international conference on computational linguistics (COLING), pp. 1091–1102
Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans Inf Theory 24(5):530–536
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kanda, S., Morita, K. & Fuketa, M. Compressed double-array tries for string dictionaries supporting fast lookup. Knowl Inf Syst 51, 1023–1042 (2017). https://doi.org/10.1007/s10115-016-0999-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-016-0999-8