skip to main content
10.1145/3448016.3457319acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Efficient String Sort with Multi-Character Encoding and Adaptive Sampling

Published:18 June 2021Publication History

ABSTRACT

Sorting plays a fundamental role in computer science. It has far reaching applications in database operations and data science tasks. An important class of sorting keys are strings and among all string sorting methods, radix sort is a simple but effective algorithm. Many works have been studied to accelerate radix string sort. One typical approach is to process multiple characters in each sorting pass. However, this approach incurs the crucial issue of the radix being too large. To address the problem, we introduce a novel multi-character encoding based method that can significantly reduce the radix. This new encoding scheme takes advantage of the sparse alphabet space usage as well as the sparsity of distinguishing prefixes of the inputs which are commonly seen in real-world datasets. Combining the effective encoding scheme with an adaptive sampling process to generate the encoding efficiently, our proposed sorting algorithm essentially blends radix sort with sample sort and achieves substantial improvement over other sorting approaches. The results on both real datasets and synthetic datasets show that our method yields an average 4.85× performance improvement over C++ STL sort[21], 1.47× improvement over the state-of-the-art Radix Sort on strings implementation[19] and 2.55× over the multikey quicksort[6]. Preliminary tests in a multi-core environment also show it is competitive or better than the most recent parallel string sorting algorithm pS5[8] which demonstrates the scalability of our method.

Skip Supplemental Material Section

Supplemental Material

3448016.3457319.mp4

mp4

168.8 MB

References

  1. AMCE. 2021. github. https://github.com/amce2021/AMCE2021Google ScholarGoogle Scholar
  2. Arne Andersson and Stefan Nilsson. 1994. A New Efficient Radix Sort. In Proceedings 35th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, Santa Fe, NM, US, 714--721. https://doi.org/10.1109/SFCS.1994.365721Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Arne Andersson and Stefan Nilsson. 1998. Implementing Radixsort. ACM Journal of Experimental Algorithmics (JEA), Vol. 3 (1998), 7--23. https://dl.acm.org/doi/10.1145/297096.297136Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. AJörg Arndt. 2010. Mixed radix numbers. In Matters Computational. Springer, Berlin, Heidelberg, 217--231. https://doi.org/10.1007/978--3--642--14764--7_9Google ScholarGoogle Scholar
  5. Michael Axtmann, Sascha Witt, Daniel Ferizovic, and Peter Sanders. 2017. In-place Parallel Super Scalar Sample sort (IPS4o). In Proceedings of European Symposium on Algorithms (ESA). Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 9:1--9:14. https://doi.org/10.4230/LIPIcs.ESA.2017.9Google ScholarGoogle Scholar
  6. Jon L. Bentley and Robert Sedgewick. 1997. Fast Algorithms for Sorting and Searching Strings. In Proceedings of the 8th ACM-SIAM Symposium on Discrete Algorithms (SODA), ACM (Ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA, United States, 360--369. https://dl.acm.org/doi/10.5555/314161.3 10.5555/314161.314321 14321Google ScholarGoogle Scholar
  7. Timo Bingmann. 2018. Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools. Ph.D. Dissertation. Karlsruhe Institute of Technology, Karlsruhe, Germany.Google ScholarGoogle Scholar
  8. Timo Bingmann, Andreas Eberle, and Peter Sanders. 2017. Engineering Parallel String Sorting. Algorithmica, Vol. 77, 1 (Jan. 2017), 235--286. https://doi.org/10.1007/s00453-015-0071--1Google ScholarGoogle ScholarCross RefCross Ref
  9. Timo Bingmann and Peter Sanders. 2013. Parallel String Sample Sort. In Proceedings of European Symposium on Algorithms (ESA). Springer-Verlag, Sophia Antipolis, France, 169--180. https://link.springer.com/chapter/10.1007/978--3--642--40450--4_15Google ScholarGoogle ScholarCross RefCross Ref
  10. Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun Yan Gu. 2015. Sorting with Asymmetric Read and Write Costs. In The 31st ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '15). ACM, Portland, Oregon, USA, 1--12. https://doi.org/10.1145/2755573.2755604Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Minsik Cho, Daniel Brand, Rajesh Bordawekar, Ulrich Finkler, Vincent Kulandaisamy, and Ruchir Puri. 2015. PARADIS: An Efficient Parallel Algorithm for In-place Radix Sort. Proceedings of the VLDB Endowment, Vol. 8, 12 (2015), 1518--1529. https://doi.org/10.14778/2824032.2824050Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jiménez-González Daniel, Juan J. Navarro, and Luis Larriba-Peyp Josep. 2003. CC-Radix: a Cache Conscious Sorting Based on Radix Sort. In 11th IEEE Conference on Parallel, Distributed and Network-Based Processing (Euro-PDP'03). IEEE, Genova, Italy, 101--108. https://doi.org/10.1109/EMPDP.2003.1183573Google ScholarGoogle Scholar
  13. Address dataset. 2020. OpenAddresses. https://data.openaddresses.io/Google ScholarGoogle Scholar
  14. URL dataset. 2013. panthema. http://panthema.net/2013/parallel-string-sorting/Google ScholarGoogle Scholar
  15. Gianni Franceschini, S. Muthu Muthukrishnan. 2007. Radix sorting with no extra space. In Proceedings of European Symposium on Algorithms (ESA). Springer-Verlag, Eilat, Israel, 194--205. https://doi.org/10.1007/978--3--540--75520--3_19Google ScholarGoogle ScholarCross RefCross Ref
  16. GNU. 2009. C+: STL sort. https://gcc.gnu.org/onlinedocs/libstdc+/libstdc+-html-USERS-4.4/a01347.htmlGoogle ScholarGoogle Scholar
  17. Yan Gu, Yihan Sun, and Guy E. Blelloch. 2018. Algorithmic Building Blocks for Asymmetric Memories. In Proceedings of European Symposium on Algorithms (ESA). Schloss Dagstuhl - Leibniz-Zentrum fü r Informatik, Helsinki, Finland, 44:1--44:15. https://doi.org/10.4230/LIPIcs.ESA.2018.44Google ScholarGoogle Scholar
  18. Ani Kristo, Kapil Vaidya, Ugur Çetintemel, Sanchit Misra, and Tim Kraska. 2020. The Case for a Learned Sorting Algorithm. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. ACM, Portland, OR, USA, 1001--1016. https://dl.acm.org/doi/10.1145/3318464.3389752Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Juha Kärkkäinen and Tommi Rantala. 2008. Engineering Radix Sort for Strings. In Proceedings of the 15th International Symposium on String Processing and Information Retrieval (LNCS), Amihood Amir and Andrew Turpin (Eds.), Vol. 5280. Springer-Verlag, Melbourne, Australia, 3--14. https://doi.org/10.1007/978--3--540--89097--3_3Google ScholarGoogle ScholarCross RefCross Ref
  20. Peter M. McIlroy, Keith Bostic, and M. Douglas McIlroy. 1993. Engineering Radix Sort. Computing Systems, Vol. 6, 1 (1993), 5--27. https://www.usenix.org/legacy/publications/compsystems/1993/win_mcilroy.pdfGoogle ScholarGoogle Scholar
  21. David R Musser. 1997. Introspective sorting and selection algorithms. Software: Practice and Experience, Vol. 27, 8 (Aug. 1997), 983--993. http://oucsace.cs.ohio.edu/ razvan/courses/cs4040/introsort.pdfGoogle ScholarGoogle Scholar
  22. Waihong Ng and Katsuhiko Kakehi. 2007. Cache Efficient Radix Sort for String Sorting. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 2 (2007), 457--466. https://dl.acm.org/doi/10.5555/1226834.1226858Google ScholarGoogle ScholarCross RefCross Ref
  23. Waihong Ng and Katsuhiko Kakehi. 2008. Merging String Sequences by Longest Common Prefixes. IPSJ Digital Courier, Vol. 4 (2008), 69--78. https://doi.org/10.2197/ipsjdc.4.69Google ScholarGoogle ScholarCross RefCross Ref
  24. Omar Obeya, Endrias Kahssay, Edward Fan, and Julian Shun. 2019. Theoretically-Efficient and Practical Parallel In-Place Radix Sorting. In The 31st ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '19). ACM, Phoenix, AZ, USA, 213--224. https://dl.acm.org/doi/10.1145/3323165.3323198Google ScholarGoogle Scholar
  25. Robert Paige and Robert E. Tarjan. 1987. Three Partition Refinement Algorithms. SIAM J. Comput., Vol. 16, 6 (1987), 973--989. https://doi.org/10.1137/0216062Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Mixed radix. 2019. Wikipedia. https://en.wikipedia.org/wiki/Mixed_radixGoogle ScholarGoogle Scholar
  27. Naila Rahman and Rajeev Raman. 2001. Adapting Radix Sort to the Memory Hierarchy. ACM Journal of Experimental Algorithmics (JEA), Vol. 6 (2001), 7--37. https://doi.org/10.1145/945394.945401Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Peter Sanders and Sebastian Winkel. 2004. Super scalar sample sort. In Proceedings of European Symposium on Algorithms (ESA). Springer-Verlag, Bergen, Norway, 784--796. https://doi.org/10.1007/978--3--540--30140-0_69Google ScholarGoogle ScholarCross RefCross Ref
  29. Ranjan Sinha and Anthony Wirth. 2010. Engineering Burstsort: Toward Fast In-Place String Sorting. ACM Journal of Experimental Algorithmics (JEA), Vol. 15 (2010), 1--24. https://doi.org/10.1145/1671970.1671978Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ranjan Sinha and Justin Zobel. 2003 a. Cache-Conscious Sorting of Large Sets of Strings with Dynamic Tries. In 5th Workshop on Algorithm Engineering and Experiments (ALENEX). SIAM, Baltimore, MD, US, 93--105. https://archive.siam.org/meetings/alenex03/Abstracts/rsinha.pdfGoogle ScholarGoogle Scholar
  31. Ranjan Sinha and Justin Zobel. 2003 b. Efficient Trie-based Sorting of Large Sets of Strings. In Proceedings of the 26th Australasian Computer Science Conference (ACSC). Australian Computer Society, Darlinghurst, NSW, Australia, 11--18. https://dl.acm.org/doi/10.5555/783106.783108Google ScholarGoogle Scholar
  32. Ranjan Sinha and Justin Zobel. 2004. Using Random Sampling to Build Approximate Tries for Efficient String Sorting. In Proceedings of the 3rd International Workshop on Experimental and Efficient Algorithms (WEA). Springer-Verlag, Angra dos Reis, Brazil, 529--544. https://doi.org/10.1007/978--3--540--24838--5_39Google ScholarGoogle ScholarCross RefCross Ref
  33. Malte Skarupke. 2016. I Wrote a Faster Sorting Algorithm. https://probablydance.com/2016/12/27/i-wrote-a-faster-sorting-algorithm/ Retrieved Jan, 2021 fromGoogle ScholarGoogle Scholar
  34. Radix sort. 2021. Wikipedia. https://en.wikipedia.org/wiki/Radix_sortGoogle ScholarGoogle Scholar
  35. Kurt Thearling and Stephen Smith. 1992. An Improved Supercomputer Sorting Benchmark. In Proceedings Supercomputing '92. IEEE Computer Society, Minneapolis, MN, USA, 14--19. https://doi.org/10.1109/SUPERC.1992.236714Google ScholarGoogle ScholarCross RefCross Ref
  36. Words (Unix). 2021. Wikipedia. https://en.wikipedia.org/wiki/Words_(Unix)Google ScholarGoogle Scholar
  37. Jan Wassenberg and Peter Sanders. 2011. Engineering a Multi-core Radix Sort. In Proceedings of the 17th international conference on Parallel processing (Euro-Par). Springer-Verlag, Bordeaux, France, 160--169. https://doi.org/10.1007/978--3--642--23397--5_16Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Efficient String Sort with Multi-Character Encoding and Adaptive Sampling

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
        June 2021
        2969 pages
        ISBN:9781450383431
        DOI:10.1145/3448016

        Copyright © 2021 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 June 2021

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader