Skip to main content

Using Random Sampling to Build Approximate Tries for Efficient String Sorting

  • Conference paper
Book cover Experimental and Efficient Algorithms (WEA 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3059))

Included in the following conference series:

Abstract

Algorithms for sorting large datasets can be made more efficient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string sorting algorithm that on large sets of strings is almost twice as fast as previous algorithms, primarily because it is more cache-efficient. The approach in burstsort is to dynamically build a small trie that is used to rapidly allocate each string to a bucket. In this paper, we introduce new variants of our algorithm: SR-burstsort, DR-burstsort, and DRL-burstsort. These algorithms use a random sample of the strings to construct an approximation to the trie prior to sorting. Our experimental results with sets of over 30 million strings show that the new variants reduce cache misses further than did the original burstsort, by up to 37%, while simultaneously reducing instruction counts by up to 24%. In pathological cases, even further savings can be obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Andersson, A., Nilsson, S.: Implementing radixsort. ACM Jour. of Experimental Algorithmics 3(7) (1998)

    Google Scholar 

  2. Arge, L., Ferragina, P., Grossi, R., Vitter, J.S.: On sorting strings in external memory. In: Proceedings of the 29th Annual ACM Symposium on Theory of Computing, El Paso, pp. 540–548. ACM Press, New York (1997)

    Google Scholar 

  3. Bentley, J., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: Proc. Annual ACM-SIAM Symp. on Discrete Algorithms, New Orleans, Louisiana, pp. 360–369. ACM/SIAM (1997)

    Google Scholar 

  4. Gupta, R., Smolka, S.A., Bhaskar, S.: On randomization in sequential and distributed algorithms. ACM Computing Surveys 26(1), 7–86 (1994)

    Article  Google Scholar 

  5. Harman, D.: Overview of the second text retrieval conference (TREC-2). Information Processing & Management 31(3), 271–289 (1995)

    Article  Google Scholar 

  6. Hawking, D., Craswell, N., Thistlewaite, P., Harman, D.: Results and challenges in web search evaluation. In: Proc. World-Wide Web Conference (1999)

    Google Scholar 

  7. Heinz, S., Zobel, J., Williams, H.E.: Burst tries: A fast, efficient data structure for string keys. ACM Transactions on Information Systems 20(2), 192–223 (2002)

    Article  Google Scholar 

  8. LaMarca, A., Ladner, R.E.: The influence of caches on the performance of sorting. In: Proc. Annual ACM-SIAM Symp. on Discrete Algorithms, pp. 370–379. ACM Press, New York (1997)

    Google Scholar 

  9. McIlroy, P.M., Bostic, K., McIlroy, M.D.: Engineering radix sort. Computing Systems 6(1), 5–27 (1993)

    Google Scholar 

  10. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (1995)

    MATH  Google Scholar 

  11. Nilsson, S.: Radix Sorting & Searching. PhD thesis, Department of Computer Science, Lund, Sweden (1996)

    Google Scholar 

  12. Olken, F., Rotem, D.: Random sampling from databases - a survey. Statistics and Computing 5(1), 25–42 (1995)

    Article  Google Scholar 

  13. Rahman, N., Raman, R.: Adapting radix sort to the memory hierarchy. ACM Jour. of Experimental Algorithmics 6(7) (2001)

    Google Scholar 

  14. Seward, J.: Valgrind—memory and cache profiler (2001), http://developer.kde.org/~sewardj/docs-1.9.5/cg_techdocs.html

  15. Sinha, R., Zobel, J.: Cache-conscious sorting of large sets of strings with dynamic tries. In: Ladner, R. (ed.) 5th ALENEX Workshop on Algorithm Engineering and Experiments, Baltimore, Maryland, January 2003, pp. 93–105 (2003)

    Google Scholar 

  16. Sinha, R., Zobel, J.: Efficient trie-based sorting of large sets of strings. In: Oudshoorn, M. (ed.) Proceedings of the Australasian Computer Science Conference, Adelaide, Australia, February 2003, pp. 11–18 (2003)

    Google Scholar 

  17. Xiao, L., Zhang, X., Kubricht, S.A.: Improving memory performance of sorting algorithms. ACM Jour. of Experimental Algorithmics 5, 3 (2000)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sinha, R., Zobel, J. (2004). Using Random Sampling to Build Approximate Tries for Efficient String Sorting. In: Ribeiro, C.C., Martins, S.L. (eds) Experimental and Efficient Algorithms. WEA 2004. Lecture Notes in Computer Science, vol 3059. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24838-5_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24838-5_39

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-22067-1

  • Online ISBN: 978-3-540-24838-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics