Skip to main content
Log in

Toward a new approach for sorting extremely large data files in the big data era

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

The extensive amount of data and contents generated today will require a paradigm shift in processing and management techniques for these data. One of the important data processing operations is the data sorting. Using multiple passes in external merge sort has a great influence on speeding up the sorting of extremely large data files. Since in large files, the swapping time is dominant in many applications, algorithms that minimize the swapping operations are normally superior to those which only focus on CPU time optimizations. In sorting extremely large files, external algorithms, such as the merge sort, are normally used. It is shown that using multiple passes over the data set, as proposed in our algorithm, has resulted in a great improvement in the number of swaps, thus, reducing the overall sorting time. Moreover, the proposed technique is suitable to be used with the emerging parallelization techniques such as GPUs. The reported results show the superiority of the proposed technique for “CPU only” and hybrid CPU–GPU implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. The GitHub repository link will be provided here once the paper is accepted.

References

  1. Bitton, D., DeWitt, D.J., Hsaio, D.K., Menon, J.: A taxonomy of parallel sorting. ACM Comput. Surv. 16(3), 287–318 (1984)

    Article  MathSciNet  MATH  Google Scholar 

  2. Knuth, D.E.: The Art of Computer Programming, vol 3. Sorting and Searching, 2nd edn. Addison Wesley, Massachusetts (1998)

  3. Graefe, G.: Query evaluation techniques for large databases. ACM Comput. Surv. 25(2), 73–170 (1993)

    Article  Google Scholar 

  4. John, L.H., David, A.P.: Computer Organization and Design (3rd): The Hardware/Software Interface. Morgan Kaufmann Publishers Inc, San Francisco, CA (2004)

    Google Scholar 

  5. Peter, J.D.: Virtual memory. ACM Comput. Surv. 2(3), 153–189 (1970)

    Article  MATH  Google Scholar 

  6. Shatnawi, A., Alzahouri, Y.: A multi-pass algorithm for sorting extremely large data files. In: 2015 6th International Conference on Information and Communication Systems (ICICS), pp. 79–82. IEEE (2015)

  7. Manber, U.: Introduction to Algorithms: A Creative Approach. Addison-Wesley, Reading, MA (1989)

    MATH  Google Scholar 

  8. Thomas, H.C., Charles, E.L., Ronald, L.R.: Introduction to Algorithms. McGraw-Hill, New York (1989)

    Google Scholar 

  9. Leu, F.C., Tsai, Y.T., Tang, C.Y.: An efficient external sorting algorithm. Inf. Process. Lett. 75(4), 159–163 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  10. Shehab, M.A., Yaseen, Q., Al-Ayyoub, M., Albalas, F., Jararweh, Y.: Accelerating FCM-based text classification algorithm using GPUs. In: 2016 IEEE High Performance Extreme Computing Conference (HPEC-2016), Boston, USA (2016)

  11. Shehab, M.A., Ghadawi, A.A., Alawneh, L., Al-Ayyoub, M., Jararweh, Y.: A hybrid CPU-GPU implementation to accelerate multiple pairwise protein sequence alignment. In: The 8th International Conference on Information and Communication Systems, Irbid (2017)

  12. Shehab, M.A., Al-Ayyoub, M., Jararweh, Y., Jarrah, M.: Accelerating compute-intensive image segmentation algorithms using GPUs. J. Supercomput. 1, 1–23 (2016)

    Google Scholar 

  13. Cook, S., Programming, C.U.D.A.: A Developer’s Guide to Parallel Computing with GPUs. Morgan Kaufmann, San Francisco, CA (2012)

    Google Scholar 

  14. Sintorn, E., Assarsson, U.: Fast parallel GPU-sorting using a hybrid algorithm. J. Parallel Distrib. Comput. 68(10), 1381–1388 (2008)

    Article  MATH  Google Scholar 

  15. Satish, N., Harris, M., Garland, M.: Designing efficient sorting algorithms for many core GPUs. In IEEE International Symposium on Parallel & Distributed Processing, 2009 (IPDPS 2009), pp. 1–10. IEEE (2009)

  16. Neelima, B., Shamsundar, B.B., Narayan, A., Prabhu, R., Gomes, C.: Kepler GPU accelerated recursive sorting using dynamic parallelism. Concurr. Comput. Pract. Exp. 29(4) (2017). https://doi.org/10.1002/cpe.3865

  17. Ye, Y., Du, Z., Bader, D.A., Yang, Q., Huo, W.: GPUMemSort: a high performance graphics co-processors sorting algorithm for large scale in-memory data. GSTF J. Comput. 1(2), 23–27 (2018)

    Google Scholar 

  18. Singh, D., Reddy, C.K.: A survey on platforms for big data analytics. J. Big Data 2(1), 8 (2015)

    Article  Google Scholar 

  19. Jiang, H., Chen, Y., Qiao, Z., Weng, T.H., Li, K.C.: Scaling up MapReduce-based big data processing on multi-GPU systems. Clust. Comput. 18(1), 369–383 (2015)

    Article  Google Scholar 

  20. O’Driscoll, A., Daugelaite, J., Sleator, R.D.: ‘Big data’, Hadoop and cloud computing in genomics. J. Biomed. Inform. 46(5), 774–781 (2013)

    Article  MATH  Google Scholar 

  21. Islam, M.R., Uddin, S.M.R., Roy, C.: Computational complexities of the external sorting algorithm with no additional disk space. Int. J. Comput. Internet Manag. (IJCIM) 13(3), 60–68 (2005)

    Google Scholar 

  22. Islam, M.R., Nusrat, W., Hossain, M., Rana, S.M.M.: A new external sorting algorithm with no additional disk space with special in-place merging technique. In: International Conference on Computer and Information Technology (ICCIT), 26–28 December 2004; Dhaka, Bangladesh (2004)

  23. Islam, M.R., Adnan, N., Islam, N., Hossen, S.: A new external sorting algorithm with no additional disk space. Inf. Process. Lett. 86, 229–233 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  24. Agarwal, A., Vitter, J.: The input/output complexity of sorting and related problems. Commun. ACM 31(8), 1116–1127 (1988)

    Article  MathSciNet  Google Scholar 

  25. Dufrene, W.R., Lin, F.C.: An efficient sorting algorithm with no additional space. Comput. J. 35(3), 308–310 (1992)

    Article  Google Scholar 

  26. Betty, S.: Merging sorted runs using large main memory. Acta Inf. 27(3), 195–215 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  27. Zheng, L., Larson, P.-Å.: Speeding up external mergesort. IEEE Trans. Knowl. Data Eng. 8(2), 322–332 (1996)

    Article  Google Scholar 

  28. Zheng, L., Larson, P.-Å.: Buffering and read-ahead strategies for external merge sort. In: Proceedings of the International Conference on Very Large Databases, vol. 24, pp. 523–533 (1998)

  29. Yiannis, J., Zobel, J.: Compression techniques for fast external sorting. VLDB J. 16(2), 269–291 (2007)

    Article  Google Scholar 

  30. Verkamo, A.I.: Performance comparison of distributive and merge sort as external sorting algorithms. J. Syst. Softw. 10(3), 187–200 (1989)

    Article  Google Scholar 

  31. Nodine, M.H., Vitter, J.S.: Deterministic distribution sort in shared and distributed memory multiprocessors. In: Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures 1993, Velen, June 30–July 02. Germany (1993)

  32. Cunto, W., Gonnet, G.H., Munro, J.I., Poblete, P.V.: Fringe analysis for extquick: an in situ distributive external sorting algorithm. Inf. Comput. 92(2), 141–160 (1991)

    Article  MathSciNet  MATH  Google Scholar 

  33. Verkamo, A.I.: External Quicksort. Performance Evaluation 8(4), 271–288 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  34. Wegner, L.M., Teuhola, J.I.: The external heapsort. IEEE Trans. Softw. Eng. 15(7), 917–925 (1989)

    Article  Google Scholar 

  35. Samet, H.: Applications of Spatial Data Structures: Computer Graphics, Image Processing, and GIS. Addison Wesley, Reading, MA (1990)

    Google Scholar 

  36. Arge, L., Vengroff, D.E., Vitter, J.S.: External-memory algorithms for processing line segments in geographic information systems. Algorithmica 47(1), 1–25 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  37. Lars, A.: External-memory algorithms with applications in GIS. In: van Kreveld, M., Nievergelt, J., Roos, T., Widmayer, P. (eds.) Algorithmic Foundations of Geographic Information Systems, pp. 213–254. Springer, Berlin (1996) (this book originated from the CISM Advanced School on the Algorithmic Foundations of Geographic Information Systems)

  38. Won, K.: Introduction to Object-Oriented Databases. MIT Press, Cambridge (1990)

    Google Scholar 

  39. Funkhouser, T.A., Sequin, C.H., Teller, S.J.: Management of large amounts of data in interactive building walkthroughs. In: Proceedings of the 1992 Symposium on Interactive 3D Graphics, Cambridge, MA, I3D ‘92, pp. 11–20. ACM, New York (1992)

  40. NASA.: NASA’s Earth Observing System (EOS) web page, NASA Goddard Space Flight Center, http://eospso.gsfc.nasa.gov/

  41. TerraServer-USA.: Microsoft’s Online Database of Satellite Images. http://terraserver.microsoft.com/

  42. Google Earth Online Database of Satellite Images. http://earth.google.com/

  43. Paul, W.: Data Ware Housing. Elsevier, Amsterdam (2000)

    Google Scholar 

  44. Matsumoto, K., Nakasato, N., Sedukhin, S.G.: Blocked all-pairs shortest paths algorithm for hybrid CPU-GPU system. In: 2011 IEEE 13th International Conference on High Performance Computing and Communications (HPCC), pp. 145–152. IEEE (2011)

  45. Souza, D.S., Santos, H.G., Coelho, I.M., Araujo, J.A.: A hybrid CPU-GPU scatter search for large-sized generalized assignment problems. In: International Conference on Computational Science and Its Applications, pp. 133–147. Springer, Cham (2017)

  46. Shehab, M.A., Ghadawi, A.A., Alawneh, L., Al-Ayyoub, M., Jararweh, Y.: A hybrid CPU-GPU implementation to accelerate multiple pairwise protein sequence alignment. In: 2017 8th International Conference on Information and Communication Systems (ICICS), pp. 12–17. IEEE (2017)

  47. Nvidia.: “Nvidia Kepler GK110, Next-Generation Cuda Compute Architecture. Nvidia (2017)

  48. Alandoli, M., Al-Ayyoub, M., Al-Smadi, M., Jararweh, Y., Benkhelifa, E.: Using dynamic parallelism to speed up clustering-based community detection in social networks. In: IEEE International Conference on Future Internet of Things and Cloud Workshops (FiCloudW), pp. 240–245. IEEE (2016)

  49. Sorting Benchmark http://sortbenchmark.org/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Shatnawi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shatnawi, A., AlZahouri, Y., Shehab, M.A. et al. Toward a new approach for sorting extremely large data files in the big data era. Cluster Comput 22, 819–828 (2019). https://doi.org/10.1007/s10586-018-2860-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-018-2860-1

Keywords

Navigation