Abstract
Due to the limitations of power consumption and memory capacity, the past few years have observed a strong trend of using heterogeneous environment equipped with accelerators, such as GPU (Graphic Processing Unit) and FPGA (Field Programmable Gate Array), and even MIC (Many Integrated Core), to help the traditional SMP (Symmetric Multi-Processing) CPU to speed up applications. In this paper, we choose the Intel MIC architecture coprocessor as the accelerator and design HostoSink, a runtime system for collaborative scheduling based on Pthread task. With the help of runtime characteristics of the application and the heterogeneous environment for scheduling the Pthread tasks between CPU and MIC automatically and dynamically, HostoSink provides MIC users with an easier way to gain high performance in heterogeneous CPU-MIC environment without the need of optimizing the original Pthread-based multi-threaded applications manually too much. Experimental results show that by using HostoSink, the overall speedup can achieve more than 3x speedup compared with the original performance by using CPU only and the average amount of data transmission between CPU and MIC is also reduced.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
CUDA documents, http://developer.download.nvidia.com/compute/cuda/docs/CUDA_Architecture_Overview.pdf
John, E.S., David, G., Shi, G.: OpenCL: A parallel programming standard for heterogeneous computing systems. IEEE Science & Engineering Magazine 12(3), 66–68 (2010)
Scanniello, G., Ugo, E., Giuseppe, C., Carmine, G.: Using the GPU to Green an Intensive and Massive Computation System. In: 17th IEEE European Conference on Software Maintenance and Reengineering (CSMR), pp. 384–387. IEEE Press (2013)
Xiao, S., Balaji, P., Dinan, J., Zhu, Q., Thakur, R., Coghlan, S., Lin, H., Wen, G., Hong, J., Feng, W.: Transparent Accelerator Migration in a Virtualized GPU Environment. In: 12th IEEE/ACM Symposimu on Cluster, Cloud and Grid Computing (CCGrid), pp. 124–131. IEEE Press (2012)
Alécio, P.D.B., Carlos, E.P., Arjan, K., Andre, S., Dieter, W.F.: An effective dynamic scheduling runtime and tuning system for heterogeneous multi and many-core desktop platforms. In: 13th IEEE International Conference on High Performance Computing and Communications (HPCC), pp. 78–85. IEEE Press (2011)
Alexander, H., Michael, K., Bungartz, H.: From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture. IEEE Science & Engineering Magazine 14(2), 78–83 (2012)
Top500 supercomputer sites, http://www.top500.org/blog/lists/2013/11/press-release
Jeffrey, S.V., Richard, G., Jack, D., Karsten, S., Bruce, L., Stephen, M., Jeremy, M.: Keeneland: Bringing heterogeneous gpu computing to the computational science community. IEEE Science & Engineering Magazine 13(5), 90–95 (2011)
Fan, K., Kudlur, M., Dasika, G., Mahlke, S.: Bridging the computation gap between programmable processors and hardwired accelerators. In: 15th IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 313–322. IEEE Press (2009)
Givargis, T., Vahid, F.: Platune: A tuning framework for system-on-a-chip platforms. IEEE Transactions on Computer Aided Design of Integrated Circuits and Systems (CADICS) 21(11), 1317–1327 (2002)
Tan, Z., Waterman, A., Avizienis, R., Lee, Y., Cook, H., Patterson, D., Asanović, K.: RAMP gold: An FPGA-based architecture simulator for multiprocessors. In: 47th ACM Design Automation Conference, pp. 463–468. ACM Press (2010)
Intel developers guide, http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-coprocessor-system-software-developers-guide.html
Diaz, J., Camelia, M., Alfonso, N.: A survey of parallel programming models and tools in the multi and many-core era. IEEE Transactions on Parallel and Distributed Systems (TPDS) 23(8), 1369–1386 (2012)
Saule, E., Umit, V.C.: An early evaluation of the scalability of graph algorithms on the Intel MIC architecture. In: 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), pp. 1629–1639. IEEE Press (2012)
Marjan, M., Jan, H., Anthony, M.S.: When and how to develop domain-specific languages. ACM Transactions on Computing Surveys (CSUR) 37(4), 316–344 (2005)
Michael, D.L., Jamison, D.C., Wang, H., Meng, T.H.: Merge: A programming model for heterogeneous multi-core systems. ACM Transactions on SIGOPS Operating Systems Review 42(2), 287–296 (2008)
Naila, F., Andrew, K., Gregory, D., Sudhakar, Y., Karsten, S.: A framework for dynamically instrumenting gpu compute applications within gpu ocelot. In: 4th ACM Workshop on General Purpose Processing on Graphics Processing Units, pp. 9–17. ACM Press (2011)
Arvind, S., Lee, H., Brown, K., Rompf, T., Chafi, H., Wu, M., Atreya, A., Odersky, M., Olukotun, K.: OptiML: An implicitly parallel domain-specific language for machine learning. In: 28th IMLS International Conference on Machine Learning (ICML), pp. 609–616. IEEE Press (2011)
Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N., Hwu, W.: An asymmetric distributed shared memory model for heterogeneous parallel systems. ACM Transactions on SIGARCH Computer Architecture News 38(1), 347–358 (2010)
Yang, Y., Xiang, P., Kong, J., Zhou, H.: A GPGPU compiler for memory optimization and parallelism management. ACM Sigplan Notices 45(6), 86–97 (2010)
Hameed, R., Qadeer, W., Wachs, M., Azizi, O., Solomatnikov, A., Lee, B.C., Richardson, S., Kozyrakis, C., Horowitz, M.: Understanding sources of inefficiency in general-purpose chips. In: 37th IEEE/ACM International Symposium on Computer Architecture (ISCA), pp. 37–47. IEEE Press (2010)
Qin, S., Geng, X., Jiang, Y.: Automatic Dynamic Task Distribution between CPU and GPU for VR Systems. Applied Mechanics and Materials 157, 1324–1330 (2012)
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience (CCPE) 23(2), 187–198 (2011)
Winter, J.A., Albonesi, D.H., Shoemaker, C.A.: Scalable thread scheduling and global power management for heterogeneous many-core architectures. In: 19th ACM International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 29–40. ACM Press (2010)
Song, H., Choi, K.: Autonomic Diffusive Load Balancing on Many-core Architecture using Simulated Annealing. In: 9th International Conference on Autonomic and Autonomous Systems (ICAS), pp. 90–95. IEEE Press (2013)
Bartzas, A., Bellasi, P., Anagnostopoulos, I., Silvano, C., Fornaciari, W., Soudris, D., Melpignano, D., Ykman-Couvreur, C.: Runtime Resource Management Techniques for Many-core Architectures: The 2PARMA Approach. In: The International Conference on Engineering of Reconfigurable Systems and Algorithms (ERSA), pp. 835–840. IEEE Press (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Liao, X., Xiang, X., Jin, H., Zhang, W., Lu, F. (2014). HostoSink: A Collaborative Scheduling in Heterogeneous Environment. In: Sun, Xh., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol 8630. Springer, Cham. https://doi.org/10.1007/978-3-319-11197-1_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-11197-1_17
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11196-4
Online ISBN: 978-3-319-11197-1
eBook Packages: Computer ScienceComputer Science (R0)