Compressed double-array tries for string dictionaries supporting fast lookup

Kanda, Shunsuke; Morita, Kazuhiro; Fuketa, Masao

doi:10.1007/s10115-016-0999-8

Compressed double-array tries for string dictionaries supporting fast lookup

Regular Paper
Published: 04 October 2016

Volume 51, pages 1023–1042, (2017)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Shunsuke Kanda¹,
Kazuhiro Morita¹ &
Masao Fuketa¹

736 Accesses
12 Citations
Explore all metrics

Abstract

A string dictionary is a basic tool for storing a set of strings in many kinds of applications. Recently, many applications need space-efficient dictionaries to handle very large datasets. In this paper, we propose new compressed string dictionaries using improved double-array tries. The double-array trie is a data structure that can implement a string dictionary supporting extremely fast lookup of strings, but its space efficiency is low. We introduce approaches for improving the disadvantage. From experimental evaluations, our dictionaries can provide the fastest lookup compared to state-of-the-art compressed string dictionaries. Moreover, the space efficiency is competitive in many cases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

Yet another Part-of-Speech and Morphological Analyzer at http://taku910.github.io/mecab/.
An open-source full-text search engine and column store at http://groonga.org/.
Operator \(\oplus \) denotes an XOR (exclusive OR) operation. While traditional implementations use a PLUS (\(+\)), the XOR (\(\oplus \)) is often substituted in recent ones such as [42] and Darts-clone at https://github.com/s-yata/darts-clone.
https://dumps.wikimedia.org.
http://download.geonames.org/export/dump/allCountries.zip.
http://dist.s-yata.jp/corpus/nwc2010/ngrams/word/over999/filelist.
http://data.law.di.unimi.it/webdata/uk-2005/uk-2005.urls.gz.
http://pizzachili.dcc.uchile.cl/texts/dna/dna.gz.
https://github.com/ot/path_decomposed_tries.
https://github.com/migumar2/libCSD.

References

Aoe J (1989) An efficient digital search algorithm by using a double-array structure. IEEE Trans Softw Eng 15(9):1066–1077
Article Google Scholar
Aoe J, Morimoto K (1992) An efficient implementation of trie structures. Softw Pract Exp 22(9):695–721
Article Google Scholar
Arroyuelo D, Cánovas R, Navarro G, Sadakane K (2010) Succinct trees in practice. In: Proceedings of the 11st meeting on algorithm engineering and experimentation (ALENEX), pp. 84–97
Arz J, Fischer J (2014) LZ-compressed string dictionaries. In: Proceedings of the data compression conference (DCC), pp. 322–331
Baeza-Yates R, Ribeiro-Neto B (2011) Modern information retrieval, 2nd edn. Addison Wesley, Boston
Google Scholar
Bast H, Mortensen CW, Weber I (2008) Output-sensitive autocompletion search. Inf Retr 11(4):269–286
Article Google Scholar
Benoit D, Demaine ED, Munro JI, Raman R, Raman V, Rao SS (2005) Representing trees of higher degree. Algorithmica 43(4):275–292
Article MathSciNet MATH Google Scholar
Boldi P, Codenotti B, Santini M, Vigna S (2004) Ubicrawler: a scalable fully distributed web crawler. Softw Pract Exp 34(8):711–726
Article Google Scholar
Brisaboa NR, Ladra S, Navarro G (2013) DACs: bringing direct access to variable-length codes. Inf Process Manag 49(1):392–404
Article Google Scholar
Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. MIT press, Cambridge
MATH Google Scholar
Dundas JA (1991) Implementing dynamic minimal-prefix tries. Softw Pract Exp 21(10):1027–1040
Article Google Scholar
Ferragina P, Grossi R, Gupta A, Shah R, Vitter JS (2008) On searching compressed string collections cache-obliviously. In: Proceedings of the 27th symposium on principles of database systems (PODS), ACM, pp. 181–190
Ferragina P, Luccio F, Manzini G, Muthukrishnan S (2009) Compressing and indexing labeled trees, with applications. J ACM 57(1):4
Article MathSciNet MATH Google Scholar
Fredkin E (1960) Trie memory. Commun ACM 3(9):490–499
Article Google Scholar
Fuketa M, Kitagawa H, Ogawa T, Morita K, Aoe J (2014) Compression of double array structures for fixed length keywords. Inf Process Manag 50(5):796–806
Article Google Scholar
Fuketa M, Morita K, Aoe J (2014) Comparisons of efficient implementations for DAWG. In: Proceedings of the 7th international conference on computer science and information technology (ICCSIT)
González R, Grabowski S, Mäkinen V, Navarro G (2005) Practical implementation of rank and select queries. In: Poster proceedings of the 4th workshop on experimental and efficient a lgorithms (WEA), pp. 27–38
Grossi R, Ottaviano G (2014) Fast compressed tries through path decompositions. ACM J Exp Algorithm 19(1):3–4
MathSciNet MATH Google Scholar
Heaps HS (1978) Information retrieval: computational and theoretical aspects. Academic Press Inc, Orlando
MATH Google Scholar
Hu TC, Tucker AC (1971) Optimal computer search trees and variable-length alphabetical codes. SIAM J Appl Math 21(4):514–532
Article MathSciNet MATH Google Scholar
Kanda S, Fuketa M, Morita K, Aoe J (2016) A compression method of double-array structures using linear functions. Knowl Inf Syst 48(1):55–80
Article Google Scholar
Kim DK, Na JC, Kim JE, Park K (2005) Efficient implementation of rank and elect functions for succinct representation. Proceedings of the 4th international workshop on experimental and efficient algorithms (WEA), LNCS 3503. Springer, New York, pp 315–327
Google Scholar
Knuth DE (1998) The art of computer programming, 3: sorting and searching, 2nd edn. Addison Wesley, Redwood City
MATH Google Scholar
Kudo T, Hanaoka T, Mukai J, Tabata Y, Komatsu H (2011) Efficient dictionary and language model compression for input method editors. In: Proceedings of the 1st workshop on advances in text input methods (WTIM), pp. 19–25
Kudo T, Yamamoto K, Matsumoto Y (2004) Applying conditional random fields to Japanese morphological analysis. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp. 230–237
Larsson NJ, Moffat A (1999) Offline dictionary-based compression. In: Proceedings of the data compression conference (DCC), pp. 296–305
Maeda A, Mizushima K (2008) A compressed-array representation of automata and its application to programming language (in Japanese). In: Proceedings of the 49th IPSJ programming symposium, pp. 49–54
Martínez-Prieto MA, Brisaboa N, Cánovas R, Claude F, Navarro G (2016) Practical compressed string dictionaries. Inf Syst 56:73–108
Article Google Scholar
Morita K, Fuketa M, Yamakawa Y, Aoe J (2001) Fast insertion methods of a double-array structure. Softw Pract Exp 31(1):43–65
Article MATH Google Scholar
Munro JI, Raman V (2001) Succinct representation of balanced parentheses and static trees. SIAM J Comput 31(3):762–776
Article MathSciNet MATH Google Scholar
Navarro G, Sadakane K (2014) Fully functional static and dynamic succinct trees. ACM Trans Algorithms 10(3):16
Article MathSciNet MATH Google Scholar
Okanohara D, Sadakane K (2007) Practical entropy-compressed rank/select dictionary. In: Proceedings of the 9th meeting on algorithm engineering and expermiments (ALENEX), pp. 60–70
Oono M, Atlam ES, Fuketa M, Morita K, Aoe J (2003) A fast and compact elimination method of empty elements from a double-array structure. Softw Pract Exp 33(13):1229–1249
Article Google Scholar
Salomon D (2008) A concise introduction to data compression. Springer, London
Book MATH Google Scholar
Williams HE, Zobel J (1999) Compressing integers for fast file access. Comput J 42(3):193–201
Article Google Scholar
Witten IH, Moffat A, Bell TC (1999) Managing gigabytes: compressing and indexing documents and images. Morgan Kaufmann, San Francisco
MATH Google Scholar
Yasuhara M, Tanaka T, Norimatsu J, Yamamoto M (2013) An efficient language model using double-array structures. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp. 222–232
Yata S, Morita K, Fuketa M, Aoe J (2008) Fast string matching with space-efficient word graphs. In: Proceedings of the 4th international conference on innovations in information technology (IIT), pp. 79–83
Yata S, Oono M, Morita K, Fuketa M, Aoe J (2007) An efficient deletion method for a minimal prefix double array. Softw Pract Exp 37(5):523–534
Article Google Scholar
Yata S, Oono M, Morita K, Fuketa M, Sumitomo T, Aoe J (2007) A compact static double-array keeping character codes. Inf Process Manag 43(1):237–247
Article Google Scholar
Yata S, Oono M, Morita K, Sumitomo T, Aoe J (2006) Double-array compression by pruning twin leaves and unifying common suffixes. In: Proceedings of the 1st international conference on computing and informatics (ICOCI), pp. 1–4
Yoshinaga N, Kitsuregawa M (2014) A self-adaptive classifier for efficient text-stream processing. In: Proceedings of the 24th international conference on computational linguistics (COLING), pp. 1091–1102
Ziv J, Lempel A (1978) Compression of individual sequences via variable-rate coding. IEEE Trans Inf Theory 24(5):530–536
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Science and Intelligent Systems, Tokushima University, Minamijosanjima 2-1, Tokushima, 770-8506, Japan
Shunsuke Kanda, Kazuhiro Morita & Masao Fuketa

Authors

Shunsuke Kanda
View author publications
You can also search for this author in PubMed Google Scholar
Kazuhiro Morita
View author publications
You can also search for this author in PubMed Google Scholar
Masao Fuketa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shunsuke Kanda.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kanda, S., Morita, K. & Fuketa, M. Compressed double-array tries for string dictionaries supporting fast lookup. Knowl Inf Syst 51, 1023–1042 (2017). https://doi.org/10.1007/s10115-016-0999-8

Download citation

Received: 04 June 2016
Revised: 31 August 2016
Accepted: 26 September 2016
Published: 04 October 2016
Issue Date: June 2017
DOI: https://doi.org/10.1007/s10115-016-0999-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Compressed double-array tries for string dictionaries supporting fast lookup

Abstract

Access this article

Similar content being viewed by others

Compressed String Dictionaries via Data-Aware Subtrie Compaction

Engineering a Textbook Approach to Index Massive String Dictionaries

Enumerated Automata Implementation of String Dictionaries

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Compressed double-array tries for string dictionaries supporting fast lookup

Abstract

Access this article

Similar content being viewed by others

Compressed String Dictionaries via Data-Aware Subtrie Compaction

Engineering a Textbook Approach to Index Massive String Dictionaries

Enumerated Automata Implementation of String Dictionaries

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation