ABSTRACT
Sorting plays a fundamental role in computer science. It has far reaching applications in database operations and data science tasks. An important class of sorting keys are strings and among all string sorting methods, radix sort is a simple but effective algorithm. Many works have been studied to accelerate radix string sort. One typical approach is to process multiple characters in each sorting pass. However, this approach incurs the crucial issue of the radix being too large. To address the problem, we introduce a novel multi-character encoding based method that can significantly reduce the radix. This new encoding scheme takes advantage of the sparse alphabet space usage as well as the sparsity of distinguishing prefixes of the inputs which are commonly seen in real-world datasets. Combining the effective encoding scheme with an adaptive sampling process to generate the encoding efficiently, our proposed sorting algorithm essentially blends radix sort with sample sort and achieves substantial improvement over other sorting approaches. The results on both real datasets and synthetic datasets show that our method yields an average 4.85× performance improvement over C++ STL sort[21], 1.47× improvement over the state-of-the-art Radix Sort on strings implementation[19] and 2.55× over the multikey quicksort[6]. Preliminary tests in a multi-core environment also show it is competitive or better than the most recent parallel string sorting algorithm pS5[8] which demonstrates the scalability of our method.
Supplemental Material
- AMCE. 2021. github. https://github.com/amce2021/AMCE2021Google Scholar
- Arne Andersson and Stefan Nilsson. 1994. A New Efficient Radix Sort. In Proceedings 35th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, Santa Fe, NM, US, 714--721. https://doi.org/10.1109/SFCS.1994.365721Google ScholarDigital Library
- Arne Andersson and Stefan Nilsson. 1998. Implementing Radixsort. ACM Journal of Experimental Algorithmics (JEA), Vol. 3 (1998), 7--23. https://dl.acm.org/doi/10.1145/297096.297136Google ScholarDigital Library
- AJörg Arndt. 2010. Mixed radix numbers. In Matters Computational. Springer, Berlin, Heidelberg, 217--231. https://doi.org/10.1007/978--3--642--14764--7_9Google Scholar
- Michael Axtmann, Sascha Witt, Daniel Ferizovic, and Peter Sanders. 2017. In-place Parallel Super Scalar Sample sort (IPS4o). In Proceedings of European Symposium on Algorithms (ESA). Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 9:1--9:14. https://doi.org/10.4230/LIPIcs.ESA.2017.9Google Scholar
- Jon L. Bentley and Robert Sedgewick. 1997. Fast Algorithms for Sorting and Searching Strings. In Proceedings of the 8th ACM-SIAM Symposium on Discrete Algorithms (SODA), ACM (Ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA, United States, 360--369. https://dl.acm.org/doi/10.5555/314161.3 10.5555/314161.314321 14321Google Scholar
- Timo Bingmann. 2018. Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools. Ph.D. Dissertation. Karlsruhe Institute of Technology, Karlsruhe, Germany.Google Scholar
- Timo Bingmann, Andreas Eberle, and Peter Sanders. 2017. Engineering Parallel String Sorting. Algorithmica, Vol. 77, 1 (Jan. 2017), 235--286. https://doi.org/10.1007/s00453-015-0071--1Google ScholarCross Ref
- Timo Bingmann and Peter Sanders. 2013. Parallel String Sample Sort. In Proceedings of European Symposium on Algorithms (ESA). Springer-Verlag, Sophia Antipolis, France, 169--180. https://link.springer.com/chapter/10.1007/978--3--642--40450--4_15Google ScholarCross Ref
- Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun Yan Gu. 2015. Sorting with Asymmetric Read and Write Costs. In The 31st ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '15). ACM, Portland, Oregon, USA, 1--12. https://doi.org/10.1145/2755573.2755604Google ScholarDigital Library
- Minsik Cho, Daniel Brand, Rajesh Bordawekar, Ulrich Finkler, Vincent Kulandaisamy, and Ruchir Puri. 2015. PARADIS: An Efficient Parallel Algorithm for In-place Radix Sort. Proceedings of the VLDB Endowment, Vol. 8, 12 (2015), 1518--1529. https://doi.org/10.14778/2824032.2824050Google ScholarDigital Library
- Jiménez-González Daniel, Juan J. Navarro, and Luis Larriba-Peyp Josep. 2003. CC-Radix: a Cache Conscious Sorting Based on Radix Sort. In 11th IEEE Conference on Parallel, Distributed and Network-Based Processing (Euro-PDP'03). IEEE, Genova, Italy, 101--108. https://doi.org/10.1109/EMPDP.2003.1183573Google Scholar
- Address dataset. 2020. OpenAddresses. https://data.openaddresses.io/Google Scholar
- URL dataset. 2013. panthema. http://panthema.net/2013/parallel-string-sorting/Google Scholar
- Gianni Franceschini, S. Muthu Muthukrishnan. 2007. Radix sorting with no extra space. In Proceedings of European Symposium on Algorithms (ESA). Springer-Verlag, Eilat, Israel, 194--205. https://doi.org/10.1007/978--3--540--75520--3_19Google ScholarCross Ref
- GNU. 2009. C+: STL sort. https://gcc.gnu.org/onlinedocs/libstdc+/libstdc+-html-USERS-4.4/a01347.htmlGoogle Scholar
- Yan Gu, Yihan Sun, and Guy E. Blelloch. 2018. Algorithmic Building Blocks for Asymmetric Memories. In Proceedings of European Symposium on Algorithms (ESA). Schloss Dagstuhl - Leibniz-Zentrum fü r Informatik, Helsinki, Finland, 44:1--44:15. https://doi.org/10.4230/LIPIcs.ESA.2018.44Google Scholar
- Ani Kristo, Kapil Vaidya, Ugur Çetintemel, Sanchit Misra, and Tim Kraska. 2020. The Case for a Learned Sorting Algorithm. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. ACM, Portland, OR, USA, 1001--1016. https://dl.acm.org/doi/10.1145/3318464.3389752Google ScholarDigital Library
- Juha Kärkkäinen and Tommi Rantala. 2008. Engineering Radix Sort for Strings. In Proceedings of the 15th International Symposium on String Processing and Information Retrieval (LNCS), Amihood Amir and Andrew Turpin (Eds.), Vol. 5280. Springer-Verlag, Melbourne, Australia, 3--14. https://doi.org/10.1007/978--3--540--89097--3_3Google ScholarCross Ref
- Peter M. McIlroy, Keith Bostic, and M. Douglas McIlroy. 1993. Engineering Radix Sort. Computing Systems, Vol. 6, 1 (1993), 5--27. https://www.usenix.org/legacy/publications/compsystems/1993/win_mcilroy.pdfGoogle Scholar
- David R Musser. 1997. Introspective sorting and selection algorithms. Software: Practice and Experience, Vol. 27, 8 (Aug. 1997), 983--993. http://oucsace.cs.ohio.edu/ razvan/courses/cs4040/introsort.pdfGoogle Scholar
- Waihong Ng and Katsuhiko Kakehi. 2007. Cache Efficient Radix Sort for String Sorting. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 2 (2007), 457--466. https://dl.acm.org/doi/10.5555/1226834.1226858Google ScholarCross Ref
- Waihong Ng and Katsuhiko Kakehi. 2008. Merging String Sequences by Longest Common Prefixes. IPSJ Digital Courier, Vol. 4 (2008), 69--78. https://doi.org/10.2197/ipsjdc.4.69Google ScholarCross Ref
- Omar Obeya, Endrias Kahssay, Edward Fan, and Julian Shun. 2019. Theoretically-Efficient and Practical Parallel In-Place Radix Sorting. In The 31st ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '19). ACM, Phoenix, AZ, USA, 213--224. https://dl.acm.org/doi/10.1145/3323165.3323198Google Scholar
- Robert Paige and Robert E. Tarjan. 1987. Three Partition Refinement Algorithms. SIAM J. Comput., Vol. 16, 6 (1987), 973--989. https://doi.org/10.1137/0216062Google ScholarDigital Library
- Mixed radix. 2019. Wikipedia. https://en.wikipedia.org/wiki/Mixed_radixGoogle Scholar
- Naila Rahman and Rajeev Raman. 2001. Adapting Radix Sort to the Memory Hierarchy. ACM Journal of Experimental Algorithmics (JEA), Vol. 6 (2001), 7--37. https://doi.org/10.1145/945394.945401Google ScholarDigital Library
- Peter Sanders and Sebastian Winkel. 2004. Super scalar sample sort. In Proceedings of European Symposium on Algorithms (ESA). Springer-Verlag, Bergen, Norway, 784--796. https://doi.org/10.1007/978--3--540--30140-0_69Google ScholarCross Ref
- Ranjan Sinha and Anthony Wirth. 2010. Engineering Burstsort: Toward Fast In-Place String Sorting. ACM Journal of Experimental Algorithmics (JEA), Vol. 15 (2010), 1--24. https://doi.org/10.1145/1671970.1671978Google ScholarDigital Library
- Ranjan Sinha and Justin Zobel. 2003 a. Cache-Conscious Sorting of Large Sets of Strings with Dynamic Tries. In 5th Workshop on Algorithm Engineering and Experiments (ALENEX). SIAM, Baltimore, MD, US, 93--105. https://archive.siam.org/meetings/alenex03/Abstracts/rsinha.pdfGoogle Scholar
- Ranjan Sinha and Justin Zobel. 2003 b. Efficient Trie-based Sorting of Large Sets of Strings. In Proceedings of the 26th Australasian Computer Science Conference (ACSC). Australian Computer Society, Darlinghurst, NSW, Australia, 11--18. https://dl.acm.org/doi/10.5555/783106.783108Google Scholar
- Ranjan Sinha and Justin Zobel. 2004. Using Random Sampling to Build Approximate Tries for Efficient String Sorting. In Proceedings of the 3rd International Workshop on Experimental and Efficient Algorithms (WEA). Springer-Verlag, Angra dos Reis, Brazil, 529--544. https://doi.org/10.1007/978--3--540--24838--5_39Google ScholarCross Ref
- Malte Skarupke. 2016. I Wrote a Faster Sorting Algorithm. https://probablydance.com/2016/12/27/i-wrote-a-faster-sorting-algorithm/ Retrieved Jan, 2021 fromGoogle Scholar
- Radix sort. 2021. Wikipedia. https://en.wikipedia.org/wiki/Radix_sortGoogle Scholar
- Kurt Thearling and Stephen Smith. 1992. An Improved Supercomputer Sorting Benchmark. In Proceedings Supercomputing '92. IEEE Computer Society, Minneapolis, MN, USA, 14--19. https://doi.org/10.1109/SUPERC.1992.236714Google ScholarCross Ref
- Words (Unix). 2021. Wikipedia. https://en.wikipedia.org/wiki/Words_(Unix)Google Scholar
- Jan Wassenberg and Peter Sanders. 2011. Engineering a Multi-core Radix Sort. In Proceedings of the 17th international conference on Parallel processing (Euro-Par). Springer-Verlag, Bordeaux, France, 160--169. https://doi.org/10.1007/978--3--642--23397--5_16Google ScholarCross Ref
Index Terms
- Efficient String Sort with Multi-Character Encoding and Adaptive Sampling
Recommendations
Cache Efficient Radix Sort for String Sorting
In this paper, we propose CRadix sort, a new string sorting algorithm based on MSD radix sort. CRadix sort causes fewer cache misses than MSD radix sort by uniquely associating a small block of main memory called the key buffer to each key and ...
Efficient Adaptive In-Place Radix Sorting
This paper presents a new in-place pseudo linear radix sorting algorithm. The proposed algorithm, called MSL (Map Shuffle Loop) is an improvement over ARL (Maus, 2002). The ARL algorithm uses an in-place permutation loop of linear complexity in terms of ...
A key-address mapping sort algorithm
ACOS'06: Proceedings of the 5th WSEAS international conference on Applied computer scienceVarious methods, such as address-calculation sort, distribution counting sort, radix sort, and bucket sort, adopt the values being sorted to improve sorting efficiency, but require extra storage space. This work presents a specific key-address mapping ...
Comments