Abstract
A trie is one of the data structures for keyword search algorithms and is utilized in natural language processing, reserved words search for compilers and so on. The double-array and LOUDS are efficient representation methods for the trie. The double-array provides fast traversal at time complexity of O(1), but the space usage of the double-array is larger than that of LOUDS. LOUDS is a succinct data structure with bit-string, and its space usage is extremely compact. However, its traversal speed is not so fast. This paper presents a new compression method of the double-array with keeping the retrieval speed. Our new method compresses the double-array by dividing the double-array into blocks and by using linear functions. Experimental results for varied keywords show that our new method reduced space usage of the double-array up to about 44 %, and the retrieval speed of the new method was 9–14 times faster than that of LOUDS. Moreover, the results show that the construction speed of the new method was faster than that of the conventional method for a large keyword set.












Similar content being viewed by others
Notes
The base of logarithm is 2 throughout this paper.
Darts: Double-ARray Trie System. http://chasen.org/~taku/software/darts/.
Darts-clone: A clone of the Darts. https://code.google.com/p/darts-clone/.
ChaSen legacy: an old morphological analyzer. http://chasen-legacy.sourceforge.jp/.
MeCab: Yet Another Part-of-Speech and Morphological Analyzer. http://mecab.googlecode.com/svn/trunk/mecab/doc/index.html.
In this paper, “traversal” in the trie means a transition from a parent node to a child node.
Strictly speaking, the number of blocks is \(\lceil (n+m)/\textit{bsize}\rceil \), but the ceil function is omitted for simplicity. Likewise, the extra space for rank / select operations is calculated in LOUDS.
Tx: Succinct Trie Data structure. https://code.google.com/p/tx-trie/.
WordNet 3.0. http://wordnetcode.princeton.edu/3.0/WordNet-3.0.tar.gz.
jawiki dump progress on 20150118. http://dumps.wikimedia.org/jawiki/20150118/jawiki-20150118-all-titles-in-ns0.gz.
enwiki dump progress on 20150205. http://dumps.wikimedia.org/enwiki/20150205/enwiki-20150205-all-titles-in-ns0.gz.
References
Aho AV, Corasick MJ (1975) Efficient string matching: an aid to bibliographic search. Commun ACM 18(6):333–340
Aho AV, Lam MS, Sethi R et al (2006) Compilers: principles, techniques, and tools, chaps 3 and 4, 2nd edn. Addison-Wesley, Boston
Aoe J (1989) An efficient digital search algorithm by using a double-array structure. IEEE Trans Softw Eng 15(9):1066–1077
Aoe J, Morimoto K, Sato T (1992) An efficient implementation of trie structures. Softw Pract Exp 22(9):695–721
Aoe J, Morimoto K, Shishibori M et al (1996) A trie compaction algorithm for a large set of keys. IEEE Trans Knowl Data Eng 8(3):476–491
Arroyuelo D, Cnovas R, Navarro G et al (2010) Succinct trees in practice. In: ALENEX, pp 84–97
Baeza-Yates RA, Gonnet GH (1996) Fast text searching for regular expressions or automaton searching on tries. J ACM 43(6):915–936
Benoit D, Demaine ED, Munro JI et al (2005) Representing trees of higher degree. Algorithmica 43:275–292
Brain M, Tharp A (1994) Using tries to eliminate pattern collisions in perfect hashing. IEEE Trans Knowl Data Eng 6(2):239–247
Delpratt O, Rahman N, Raman R (2006) Engineering the louds succinct tree representation. Proc WEA 2006:134–145
Fredkin E (1960) Trie memory. Commun ACM 3(9):490–499
Fu J, Hagsand O, Karlsson G (2007) Improving and analyzing LC-trie performance for IP-address lookup. J Netw 2(3):18–27
Fuketa M, Kitagawa H, Ogawa T et al (2014) Compression of double array structures for fixed length keywords. Inf Process Manag 50(5):796–806
Huang K, Xie G, Li Y, et al (2011) Offset addressing approach to memory-efficient IP address lookup. In: Proceedings of the IEEE INFOCOM, pp 306–310
Jacobson G (1989) Space-efficient static trees and graphs. In: 30th annual symposium on foundations of computer science, pp 549–554
Jansson J, Sadakane K, Sung W (2007) Ultra-succinct representation of ordered trees. In: ACM–SIAM symposium on discrete algorithms, pp 575–584
Liu H, Nuo M, Ma L et al (2011) Compression methods by code mapping and code dividing for Chinese dictionary stored in a double-array trie. In: IJCNLP, pp 1189–1197
Morita K, Fuketa M, Yamakawa Y et al (2001) Fast insertion methods of a double-array structure. Softw Pract Exp 31(1):43–65
Morita K, Atlam E, Fuketa M et al (2004) Fast and compact updating algorithms of a double-array structure. Inf Sci 159(12):53–67
Munro J, Raman V (2001) Succinct representation of balanced parentheses and static trees. SIAM J Comput 31:762–776
Navarro G (2004) Indexing text using the zivlempel trie. J Discret Algorithms 2(1):87–114
Peterson J (1980) Computer programs for spelling correction: an experiment in program design. Springer, Berlin
Sadakane K, Navarro G (2010) Fully-functional succinct trees. In: Proceedings of the 21st annual ACM–SIAM symposium on discrete algorithms, pp 134–149
Srinivasan V, Varghese G, Suri S et al (1998) Fast and scalable layer four switching. In: Proceedings of the conference on applications, technologies, architectures, and protocols for computer communication (ACM SIGCOMM ’98), pp 191–202
Yang L, Xu L, Shi Z (2012) An enhanced dynamic hash trie algorithm for lexicon search. Enterp Inf Syst 6(4):419–432
Yata S, Oono M, Morita K et al (2007a) An efficient deletion method for a minimal prefix double array. Softw Pract Exp 37(5):523–534
Yata S, Oono M, Morita K et al (2007b) A compact static double-array keeping character codes. Inf Process Manag 43(1):237–247
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kanda, S., Fuketa, M., Morita, K. et al. A compression method of double-array structures using linear functions. Knowl Inf Syst 48, 55–80 (2016). https://doi.org/10.1007/s10115-015-0873-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-015-0873-0