Abstract
Algorithms for sorting large datasets can be made more efficient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string sorting algorithm that on large sets of strings is almost twice as fast as previous algorithms, primarily because it is more cache-efficient. The approach in burstsort is to dynamically build a small trie that is used to rapidly allocate each string to a bucket. In this paper, we introduce new variants of our algorithm: SR-burstsort, DR-burstsort, and DRL-burstsort. These algorithms use a random sample of the strings to construct an approximation to the trie prior to sorting. Our experimental results with sets of over 30 million strings show that the new variants reduce cache misses further than did the original burstsort, by up to 37%, while simultaneously reducing instruction counts by up to 24%. In pathological cases, even further savings can be obtained.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Andersson, A., Nilsson, S.: Implementing radixsort. ACM Jour. of Experimental Algorithmics 3(7) (1998)
Arge, L., Ferragina, P., Grossi, R., Vitter, J.S.: On sorting strings in external memory. In: Proceedings of the 29th Annual ACM Symposium on Theory of Computing, El Paso, pp. 540–548. ACM Press, New York (1997)
Bentley, J., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: Proc. Annual ACM-SIAM Symp. on Discrete Algorithms, New Orleans, Louisiana, pp. 360–369. ACM/SIAM (1997)
Gupta, R., Smolka, S.A., Bhaskar, S.: On randomization in sequential and distributed algorithms. ACM Computing Surveys 26(1), 7–86 (1994)
Harman, D.: Overview of the second text retrieval conference (TREC-2). Information Processing & Management 31(3), 271–289 (1995)
Hawking, D., Craswell, N., Thistlewaite, P., Harman, D.: Results and challenges in web search evaluation. In: Proc. World-Wide Web Conference (1999)
Heinz, S., Zobel, J., Williams, H.E.: Burst tries: A fast, efficient data structure for string keys. ACM Transactions on Information Systems 20(2), 192–223 (2002)
LaMarca, A., Ladner, R.E.: The influence of caches on the performance of sorting. In: Proc. Annual ACM-SIAM Symp. on Discrete Algorithms, pp. 370–379. ACM Press, New York (1997)
McIlroy, P.M., Bostic, K., McIlroy, M.D.: Engineering radix sort. Computing Systems 6(1), 5–27 (1993)
Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (1995)
Nilsson, S.: Radix Sorting & Searching. PhD thesis, Department of Computer Science, Lund, Sweden (1996)
Olken, F., Rotem, D.: Random sampling from databases - a survey. Statistics and Computing 5(1), 25–42 (1995)
Rahman, N., Raman, R.: Adapting radix sort to the memory hierarchy. ACM Jour. of Experimental Algorithmics 6(7) (2001)
Seward, J.: Valgrind—memory and cache profiler (2001), http://developer.kde.org/~sewardj/docs-1.9.5/cg_techdocs.html
Sinha, R., Zobel, J.: Cache-conscious sorting of large sets of strings with dynamic tries. In: Ladner, R. (ed.) 5th ALENEX Workshop on Algorithm Engineering and Experiments, Baltimore, Maryland, January 2003, pp. 93–105 (2003)
Sinha, R., Zobel, J.: Efficient trie-based sorting of large sets of strings. In: Oudshoorn, M. (ed.) Proceedings of the Australasian Computer Science Conference, Adelaide, Australia, February 2003, pp. 11–18 (2003)
Xiao, L., Zhang, X., Kubricht, S.A.: Improving memory performance of sorting algorithms. ACM Jour. of Experimental Algorithmics 5, 3 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sinha, R., Zobel, J. (2004). Using Random Sampling to Build Approximate Tries for Efficient String Sorting. In: Ribeiro, C.C., Martins, S.L. (eds) Experimental and Efficient Algorithms. WEA 2004. Lecture Notes in Computer Science, vol 3059. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24838-5_39
Download citation
DOI: https://doi.org/10.1007/978-3-540-24838-5_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22067-1
Online ISBN: 978-3-540-24838-5
eBook Packages: Springer Book Archive