Abstract
We propose a novel processor-aware compaction technique for pattern matching that is widely-used in databases, information retrieval, and text mining. As the amount of data increases, it is getting important to efficiently store data on memory. A compressed suffix array (CSA) is a compact data structure for in-memory pattern matching. However, CSA suffers from tremendous processor penalties, such as a flood of instructions and cache/TLB misses due to the lack of processor-aware design. To mitigate these penalties, we propose a novel compaction technique for CSA, called suffix trie contraction (STC). The frequently accessed suffixes of CSA are transformed to a trie (e.g., a suffix trie), and then inter-connected nodes in the trie are repeatedly ’\(contracted\)’ to a single node, which enables lightweight sequential scans in a processor-friendly way. In detail, STC consists of two contraction techniques: fixed-length path contraction (FPC) and sub-tree contraction (SC). FPC is applied to the parts with a few branches in the trie, and SC is applied to the parts with many branches. Our experiment results indicate that FPC outperforms naive CSA by two orders of magnitude for short pattern queries and by three times for long pattern queries. As the number of branches inside the trie increases, SC gradually becomes superior to CSA and FPC for short pattern queries. Finally, the latency and throughput of STC are 7 times and 72 times better than those of CSA for the TREC test data set at the expense of additional 7.1 % space overhead.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Grossi, R., Ottaviano, G.: Fast compressed tries through path decompositions. In: Proceedings of ALENEX (2012)
Kim, C., et al.: Designing Fast Architecture-sensitive Tree Search on Modern Multi-core/Many-core Processors. ACM Transaction on Database Systems 36(4), 22:1–22:34 (2011)
Kreft, S., Navarro, G.: LZ77-like Compression with fast random access. In: Proceedings of DCC, pp. 239–248 (2010)
Kreft, S., Navarro, G.: Self-indexing based on LZ77. In: Giancarlo, R., Manzini, G. (eds.) CPM 2011. LNCS, vol. 6661, pp. 41–54. Springer, Heidelberg (2011)
Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. In: Proceedings of SODA, pp. 319–327 (1990)
Manzini, G.: An Analysis of the Burrows Wheeler Transform. J. ACM 48(3), 407–430 (2001)
Yamamuro, T., et al. Vast-tree: a vector-advanced and compressed structure for massive data tree traversal. In: Proceedings of EDBT, pp. 396–407 (2012)
Navarro, G., Mäkinen, V.: Compressed Full-Text Indexes. ACM Computing Surveys (CSUR) 39(1) (2007)
Mäkinen, V., Navarro, G., Sadakane, K.: Advantages of backward searching: efficient secondary memory and distributed implementation of compressed suffix arrays. In: Fleischer, R., Trippen, G. (eds.) ISAAC 2004. LNCS, vol. 3341, pp. 681–692. Springer, Heidelberg (2004)
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of FOCS (2000)
Kim, C., et al.: Closing the Ninja Performance Gap through Traditional Programming and Compiler Technology. Technical report, Intel Lab. (2012)
Hankins, R.A., Patel, J. M.: Effect of node size on the performance of cache-conscious B+trees. In: Proceedings of SIGMETRICS, pp. 283–294 (2003)
Chen, S., Gibbons, P.B., Mowry, T.C.: Improving index performance through prefetching. In: Proceedings of SIGMOD, pp. 235–246 (2001)
Zhou, J., Ross, K.A.: Buffering accesses to memory-resident index structures. In: Proceedings of VLDB, pp. 405–416 (2003)
Okanohara, D., Sadakane, K.: Practical entropy-compressed rank/select dictionary. In: Proceedings of ALENEX, pp. 60–70 (2006)
Schlegel, B., Gemulla, R., Lehner, W.: K-ary search on modern processors. In: Proceedings of DaMoN, pp. 52–60 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Yamamuro, T., Onizuka, M., Honjo, T. (2015). Tree Contraction for Compressed Suffix Arrays on Modern Processors. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9050. Springer, Cham. https://doi.org/10.1007/978-3-319-18123-3_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-18123-3_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18122-6
Online ISBN: 978-3-319-18123-3
eBook Packages: Computer ScienceComputer Science (R0)