Abstract
The extensive amount of data and contents generated today will require a paradigm shift in processing and management techniques for these data. One of the important data processing operations is the data sorting. Using multiple passes in external merge sort has a great influence on speeding up the sorting of extremely large data files. Since in large files, the swapping time is dominant in many applications, algorithms that minimize the swapping operations are normally superior to those which only focus on CPU time optimizations. In sorting extremely large files, external algorithms, such as the merge sort, are normally used. It is shown that using multiple passes over the data set, as proposed in our algorithm, has resulted in a great improvement in the number of swaps, thus, reducing the overall sorting time. Moreover, the proposed technique is suitable to be used with the emerging parallelization techniques such as GPUs. The reported results show the superiority of the proposed technique for “CPU only” and hybrid CPU–GPU implementations.
Similar content being viewed by others
Notes
The GitHub repository link will be provided here once the paper is accepted.
References
Bitton, D., DeWitt, D.J., Hsaio, D.K., Menon, J.: A taxonomy of parallel sorting. ACM Comput. Surv. 16(3), 287–318 (1984)
Knuth, D.E.: The Art of Computer Programming, vol 3. Sorting and Searching, 2nd edn. Addison Wesley, Massachusetts (1998)
Graefe, G.: Query evaluation techniques for large databases. ACM Comput. Surv. 25(2), 73–170 (1993)
John, L.H., David, A.P.: Computer Organization and Design (3rd): The Hardware/Software Interface. Morgan Kaufmann Publishers Inc, San Francisco, CA (2004)
Peter, J.D.: Virtual memory. ACM Comput. Surv. 2(3), 153–189 (1970)
Shatnawi, A., Alzahouri, Y.: A multi-pass algorithm for sorting extremely large data files. In: 2015 6th International Conference on Information and Communication Systems (ICICS), pp. 79–82. IEEE (2015)
Manber, U.: Introduction to Algorithms: A Creative Approach. Addison-Wesley, Reading, MA (1989)
Thomas, H.C., Charles, E.L., Ronald, L.R.: Introduction to Algorithms. McGraw-Hill, New York (1989)
Leu, F.C., Tsai, Y.T., Tang, C.Y.: An efficient external sorting algorithm. Inf. Process. Lett. 75(4), 159–163 (2000)
Shehab, M.A., Yaseen, Q., Al-Ayyoub, M., Albalas, F., Jararweh, Y.: Accelerating FCM-based text classification algorithm using GPUs. In: 2016 IEEE High Performance Extreme Computing Conference (HPEC-2016), Boston, USA (2016)
Shehab, M.A., Ghadawi, A.A., Alawneh, L., Al-Ayyoub, M., Jararweh, Y.: A hybrid CPU-GPU implementation to accelerate multiple pairwise protein sequence alignment. In: The 8th International Conference on Information and Communication Systems, Irbid (2017)
Shehab, M.A., Al-Ayyoub, M., Jararweh, Y., Jarrah, M.: Accelerating compute-intensive image segmentation algorithms using GPUs. J. Supercomput. 1, 1–23 (2016)
Cook, S., Programming, C.U.D.A.: A Developer’s Guide to Parallel Computing with GPUs. Morgan Kaufmann, San Francisco, CA (2012)
Sintorn, E., Assarsson, U.: Fast parallel GPU-sorting using a hybrid algorithm. J. Parallel Distrib. Comput. 68(10), 1381–1388 (2008)
Satish, N., Harris, M., Garland, M.: Designing efficient sorting algorithms for many core GPUs. In IEEE International Symposium on Parallel & Distributed Processing, 2009 (IPDPS 2009), pp. 1–10. IEEE (2009)
Neelima, B., Shamsundar, B.B., Narayan, A., Prabhu, R., Gomes, C.: Kepler GPU accelerated recursive sorting using dynamic parallelism. Concurr. Comput. Pract. Exp. 29(4) (2017). https://doi.org/10.1002/cpe.3865
Ye, Y., Du, Z., Bader, D.A., Yang, Q., Huo, W.: GPUMemSort: a high performance graphics co-processors sorting algorithm for large scale in-memory data. GSTF J. Comput. 1(2), 23–27 (2018)
Singh, D., Reddy, C.K.: A survey on platforms for big data analytics. J. Big Data 2(1), 8 (2015)
Jiang, H., Chen, Y., Qiao, Z., Weng, T.H., Li, K.C.: Scaling up MapReduce-based big data processing on multi-GPU systems. Clust. Comput. 18(1), 369–383 (2015)
O’Driscoll, A., Daugelaite, J., Sleator, R.D.: ‘Big data’, Hadoop and cloud computing in genomics. J. Biomed. Inform. 46(5), 774–781 (2013)
Islam, M.R., Uddin, S.M.R., Roy, C.: Computational complexities of the external sorting algorithm with no additional disk space. Int. J. Comput. Internet Manag. (IJCIM) 13(3), 60–68 (2005)
Islam, M.R., Nusrat, W., Hossain, M., Rana, S.M.M.: A new external sorting algorithm with no additional disk space with special in-place merging technique. In: International Conference on Computer and Information Technology (ICCIT), 26–28 December 2004; Dhaka, Bangladesh (2004)
Islam, M.R., Adnan, N., Islam, N., Hossen, S.: A new external sorting algorithm with no additional disk space. Inf. Process. Lett. 86, 229–233 (2003)
Agarwal, A., Vitter, J.: The input/output complexity of sorting and related problems. Commun. ACM 31(8), 1116–1127 (1988)
Dufrene, W.R., Lin, F.C.: An efficient sorting algorithm with no additional space. Comput. J. 35(3), 308–310 (1992)
Betty, S.: Merging sorted runs using large main memory. Acta Inf. 27(3), 195–215 (1989)
Zheng, L., Larson, P.-Å.: Speeding up external mergesort. IEEE Trans. Knowl. Data Eng. 8(2), 322–332 (1996)
Zheng, L., Larson, P.-Å.: Buffering and read-ahead strategies for external merge sort. In: Proceedings of the International Conference on Very Large Databases, vol. 24, pp. 523–533 (1998)
Yiannis, J., Zobel, J.: Compression techniques for fast external sorting. VLDB J. 16(2), 269–291 (2007)
Verkamo, A.I.: Performance comparison of distributive and merge sort as external sorting algorithms. J. Syst. Softw. 10(3), 187–200 (1989)
Nodine, M.H., Vitter, J.S.: Deterministic distribution sort in shared and distributed memory multiprocessors. In: Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures 1993, Velen, June 30–July 02. Germany (1993)
Cunto, W., Gonnet, G.H., Munro, J.I., Poblete, P.V.: Fringe analysis for extquick: an in situ distributive external sorting algorithm. Inf. Comput. 92(2), 141–160 (1991)
Verkamo, A.I.: External Quicksort. Performance Evaluation 8(4), 271–288 (1988)
Wegner, L.M., Teuhola, J.I.: The external heapsort. IEEE Trans. Softw. Eng. 15(7), 917–925 (1989)
Samet, H.: Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS. Addison Wesley, Reading, MA (1990)
Arge, L., Vengroff, D.E., Vitter, J.S.: External-memory algorithms for processing line segments in geographic information systems. Algorithmica 47(1), 1–25 (2007)
Lars, A.: External-memory algorithms with applications in GIS. In: van Kreveld, M., Nievergelt, J., Roos, T., Widmayer, P. (eds.) Algorithmic Foundations of Geographic Information Systems, pp. 213–254. Springer, Berlin (1996) (this book originated from the CISM Advanced School on the Algorithmic Foundations of Geographic Information Systems)
Won, K.: Introduction to Object-Oriented Databases. MIT Press, Cambridge (1990)
Funkhouser, T.A., Sequin, C.H., Teller, S.J.: Management of large amounts of data in interactive building walkthroughs. In: Proceedings of the 1992 Symposium on Interactive 3D Graphics, Cambridge, MA, I3D ‘92, pp. 11–20. ACM, New York (1992)
NASA.: NASA’s Earth Observing System (EOS) web page, NASA Goddard Space Flight Center, http://eospso.gsfc.nasa.gov/
TerraServer-USA.: Microsoft’s Online Database of Satellite Images. http://terraserver.microsoft.com/
Google Earth Online Database of Satellite Images. http://earth.google.com/
Paul, W.: Data Ware Housing. Elsevier, Amsterdam (2000)
Matsumoto, K., Nakasato, N., Sedukhin, S.G.: Blocked all-pairs shortest paths algorithm for hybrid CPU-GPU system. In: 2011 IEEE 13th International Conference on High Performance Computing and Communications (HPCC), pp. 145–152. IEEE (2011)
Souza, D.S., Santos, H.G., Coelho, I.M., Araujo, J.A.: A hybrid CPU-GPU scatter search for large-sized generalized assignment problems. In: International Conference on Computational Science and Its Applications, pp. 133–147. Springer, Cham (2017)
Shehab, M.A., Ghadawi, A.A., Alawneh, L., Al-Ayyoub, M., Jararweh, Y.: A hybrid CPU-GPU implementation to accelerate multiple pairwise protein sequence alignment. In: 2017 8th International Conference on Information and Communication Systems (ICICS), pp. 12–17. IEEE (2017)
Nvidia.: “Nvidia Kepler GK110, Next-Generation Cuda Compute Architecture. Nvidia (2017)
Alandoli, M., Al-Ayyoub, M., Al-Smadi, M., Jararweh, Y., Benkhelifa, E.: Using dynamic parallelism to speed up clustering-based community detection in social networks. In: IEEE International Conference on Future Internet of Things and Cloud Workshops (FiCloudW), pp. 240–245. IEEE (2016)
Sorting Benchmark http://sortbenchmark.org/
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shatnawi, A., AlZahouri, Y., Shehab, M.A. et al. Toward a new approach for sorting extremely large data files in the big data era. Cluster Comput 22, 819–828 (2019). https://doi.org/10.1007/s10586-018-2860-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-018-2860-1