Abstract
We present TritonSort, a highly efficient, scalable sorting system. It is designed to process large datasets, and has been evaluated against as much as 100TB of input data spread across 832 disks in 52 nodes at a rate of 0.938TB/min. When evaluated against the annual Indy GraySort sorting benchmark, TritonSort is 66% better in absolute performance and has over six times the per-node throughput of the previous record holder. When evaluated against the 100TB Indy JouleSort benchmark, TritonSort sorted 9703 records/Joule. In this article, we describe the hardware and software architecture necessary to operate TritonSort at this level of efficiency. Through careful management of system resources to ensure cross-resource balance, we are able to sort data at approximately 80% of the disks’ aggregate sequential write speed.
We believe the work holds a number of lessons for balanced system design and for scale-out architectures in general. While many interesting systems are able to scale linearly with additional servers, per-server performance can lag behind per-server capacity by more than an order of magnitude. Bridging the gap between high scalability and high performance would enable either significantly less expensive systems that are able to do the same work or provide the ability to address significantly larger problem sets with the same infrastructure.
- Aggarwal, A. and Vitter, J. S. 1988. The input/output complexity of sorting and related problems. Comm. ACM 31, 9, 1116--1127. Google ScholarDigital Library
- Alizadeh, M., Greenberg, A., Maltz, D. A., Padhye, J., Patel, P., Prabhakar, B., S Engupta, S., and Srid-Haran, M. 2010. Data center TCP (DCTCP). In Proceedings of the ACM SIGCOMM Conference. Google ScholarDigital Library
- Amdahl, G. 1970. Storage and I/O parameters and system potential. In Proceedings of the IEEE Computer Group Conference.Google Scholar
- Anderson, E. and Tucek, J. 2009. Efficiency matters! In Proceedings of the Workshop on Hot Topics in Storage and File Systems (HotStorage’09).Google Scholar
- Anon, Bitton, D., Brown, M., Catell, R., Ceri, S., et al. 1985. A measure of transaction processing power. J. Datamation. Google ScholarDigital Library
- Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., Culler, D. E., Hellerstein, J. M., and Patterson, D. A. 1997. High-Performance sorting on networks of workstations. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Arpaci-Dusseau, R., Arpaci-Dusseau, A., Culler, D., Hellerstein, J., and Patterson, D. 1998. The architectural costs of streaming I/O: A comparison of workstations, clusters, and SMPs. In Proceedings of the International Symposium on High-Performance Computer Architecture. 90--101. Google ScholarDigital Library
- Arpaci-Dusseau, R. H. 2003. Run-Time adaptation in river. ACM Trans. Comput. Syst. 21, 1. Google ScholarDigital Library
- AvocentPDU. 2011. Avocent PM3000V PDU. http://www.avocent.com/Products/Category/Power_Distribution_Units/PM1000_2000_3000_PDUs.aspx.Google Scholar
- Bryant, R. E. 2007. Data-Intensive supercomputing: The case for DISC. Tech. rep. CMU-CS-07-128, Carnegie Mellon University.Google Scholar
- Dean, J. and Ghemawat, S. 2004. MapReduce: Simpli?ed data processing on large clusters. In Proceedings of the ACM USENIX Symposium on Operating Systems Design and Implementation. Google ScholarDigital Library
- Dewitt, D., Ghandeharizadeh, S., Schneider, D., Bricker, A., Hsiao, H.-I., and Rasmussen, R. 1990. The gamma database machine project. IEEE Trans. Knowl. Data Engin. 2, 1. Google ScholarDigital Library
- Fawnsort. 2010. FAWNSort: Energy-Ef?cient sorting of 10GB. http://sortbenchmark.org/fawnsort_2010.pdf.Google Scholar
- Ghemawat, S., Gobioff, H., and Leung, S.-T. 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles. Google ScholarDigital Library
- Graefe, G. 1994. Volcano-An extensible and parallel query evaluation system. IEEE Trans. Knowl. Data Engin. 6, 1. Google ScholarDigital Library
- Gray, J. and Putzolu, G. R. 1987. The 5 minute rule for trading memory for disk accesses and the 10 byte rule for trading memory for CPU time. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Hadoop 2011. Apache hadoop. http://hadoop.apache.org/.Google Scholar
- Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the SIGOPS European Conference on Computer Systems. Google ScholarDigital Library
- Kuszmaul, B. C. 2007. Kuszmaul, B. C. 2007. TeraByte TokuSampleSort. http://sortbenchmark.org/tokutera.pdf.Google Scholar
- Nyberg, C., Barclay, T., Cvetanovic, Z., Gray, J., and Lomet, D. 1995. Alphasort: A cache-sensitive parallel external sort. In Proceedings of the International Conference on Very Large Databases. Google ScholarDigital Library
- Nyberg, C., Koester, C., and Gray, J. 1997. NSort: A parallel sorting program for NUMA and SMP machines. http://www.ordinal.com/white/whitepaper.html.Google Scholar
- Rahn, M., Sanders, P., Singler, J., and Kieritz, T. 2009. DEMSort -- Distributed external memory sort. http://sortbenchmark.org/demsort.pdf.Google Scholar
- Rasmussen, A., Porter, G., Conley, M., Madhyastha, H. V., Mysore, R. N., Pucher, A., and Vahdat, A. 2011. TritonSort: A balanced, large-scale sorting system. In Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI). Google ScholarDigital Library
- Rivoire, S., Shah, M. A., Ranganathan, P., and Kozyrakis, C. 2007. Joulesort: A balanced energy-efficiency benchmark. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’07). ACM, New York, 365--376. Google ScholarDigital Library
- Shah, M., Hellerstein, J., and Franklin, M. 2003. Flux: An adaptive partitioning operator for continuous query systems. In Proceedings of the International Conference on Data Engineering.Google Scholar
- SortBenchMark 2010. Sort benchmark home page. http://sortbenchmark.org/.Google Scholar
- WattsUpMeter 2011. WattsUp power meters. http://www.wattsupmeters.com.Google Scholar
- Welsh, M., Culler, D., and Brewer, E. 2001. SEDA: An architecture for well-conditioned, scalable internet services. In Proceedings of the SIGOPS Symposium on Operating Systems Principles. Google ScholarDigital Library
- Wyllie, J. 2005. Sorting on a cluster attached to a storage-area network. http://sortbenchmark.org/2005_SCS_Wyllie.pdf.Google Scholar
- YahooCluster. 2008. Scaling Hadoop to 4000 nodes at Yahoo! http://developer.yahoo.net/blogs/hadoop/2008/09/scaling_hadoop_to_4000_nodes_a.html.Google Scholar
Index Terms
- TritonSort: A Balanced and Energy-Efficient Large-Scale Sorting System
Recommendations
TritonSort: a balanced large-scale sorting system
NSDI'11: Proceedings of the 8th USENIX conference on Networked systems design and implementationWe present TritonSort, a highly efficient, scalable sorting system. It is designed to process large datasets, and has been evaluated against as much as 100 TB of input data spread across 832 disks in 52 nodes at a rate of 0.916 TB/min. When evaluated ...
Using Working Set Reorganization to Manage Storage Systems with Hard and Solid State Disks
ICPPW '14: Proceedings of the 2014 43rd International Conference on Parallel Processing WorkshopsScientific applications from many problem domains produce and/or access large volumes of data. To support these applications, designers of high-end computing (HEC) systems have greatly increased the capacity of storage systems in recent years. However, ...
On efficient hierarchical storage for big data processing
CCGRID '16: Proceedings of the 16th IEEE/ACM International Symposium on Cluster, Cloud, and Grid ComputingA promising trend in storage management for big data frameworks, such as Hadoop and Spark, is the emergence of heterogeneous and hybrid storage systems that employ different types of storage devices, e.g. SSDs, RAMDisks, etc., alongside traditional ...
Comments