Abstract
Advances at an unprecedented rate in computer hardware and networking technologies have made the many-core computing affordable and readily available in a matter of few years. Nonetheless, it incurs challenges to programmers to build scalable parallel software. Optimizations of parallel programs for a many-core platform are viewed as a multifaceted problem, where system and architectural factors should be taken into account. In this paper, we tackle this problem by implementing parallel programs with different available programming paradigms and evaluate application behaviors on TILE64 many-core platform. That is, we investigate a hybrid producer-write plus consumer-read shared memory programming paradigm for the implementation of master–worker video decoder and encoder in the referred many-core platform. Experimental results show that the proposed implementation has achieved competitive performance speedup, scaling well with the number of available cores and up to four times of performance improvement over other implementations on the decoding of sample 1080P video.






















Similar content being viewed by others
References
Borkar S (2007) Thousand core chips: a technology perspective. In: Proceedings of the 44th design automation conf (DAC 07), pp 746–749. doi:10.1145/1278480.1278667
Parkhurst J, Darringer J, Grundmann B (2006) From single core to multi-core: preparing for a new exponential. In: Proceedings of the IEEE/ACM int conf computer-aided design (ICCAD 06), pp 67–72. doi:10.1145/1233501.1233516
Karam L, AlKamal I, Gatherer A, Frantz G, Anderson D, Evans B (2009) Trends in multicore DSP platforms. IEEE Signal Process Mag 26(6):38–49. doi:10.1109/MSP.2009.934113
Sutter H (2005) The free lunch is over: a fundamental turn toward concurrency in software. Dr Dobb’s J 30(3):202–210
Chen G, Li F, Son SW, Kandemir M (2008) Application mapping for chip multiprocessors. In: Proceedings of the 45th design automation conf (DAC 08), pp 620–625. doi:10.1145/1391469.1391628
Tan G, Sun N, Gao GR (2007) A parallel dynamic programming algorithm on a multi-core architecture. In: Proceedings of the 19th ACM symp parallel algorithms and architectures (SPAA 07), vol 07, pp 135–144. doi:10.1145/1248377.1248399
Bell S, Edwards B, Amann J, Conlin R, Joyce K, Leung V, MacKay J, Reif M, Liewei B, Brown J, Mattina M, Chyi-Chang M, Ramey C, Wentzlaff D, Anderson W, Berger E, Fairbanks N, Khan D, Montenegro F, Stickney J, Zook J (2008) TILE64 processor: a 64-core SoC with mesh interconnect. In: Proceedings of the IEEE intl solid-state circuits conf (ISSCC 08), pp 88–598. doi:10.1109/ISSCC.2008.4523070
Chen S, Chen S, Gu H, Chen H, Yin Y, Chen X, Sun S, Liu S, Wang Y (2010) Mapping of H.264/AVC encoder on a hierarchical chip multicore DSP platform. In: Proceedings of the 12th IEEE int conf high performance computing and communications (HPCC 10), pp 465–470. doi:10.1109/HPCC.2010.82
Boutellier J, Jaaskelainen P, Silven O (2007) Run-time scheduled hardware acceleration of MPEG-4 video decoding. In: Proceedings of the 2007 int symp system-on-chip, pp 1–4
Yung NHC, Leung K-K (2001) Spatial and temporal data parallelization of the H.261 video coding algorithm. IEEE Trans Circuits Syst Video Technol 11(1):91–104
Rodriguez-Fernandez D, Vilarino DL, Pardo XM (2009) A pixel-parallel moving object segmentation and tracking algorithm for video surveillance applications. In: Proceedings of the 6th int symp image and signal processing and analysis (ISPA 09), pp 614–619
Berthold J, Dieterle M, Loogen R, Priebe S (2008) Hierarchical master–worker skeletons. In: Proceedings of the 10th int conf practical aspects of declarative languages (PADL 08). Lecture notes in computer science, pp 248–264
Benoit A, Marchal L, Pineau JF, Robert Y, Vivien F (2010) Scheduling concurrent bag-of-tasks applications on heterogeneous platforms. IEEE Trans Comput 59(2):202–217. doi:10.1109/TC.2009.117
Hoffmann H, Wentzlaff D, Agarwal A (2010) Remote store programming. In: Patt Y, Foglia P, Duesterwald E, Faraboschi P, Martorell X (eds) High performance embedded architectures and compilers. Lecture notes in computer science, vol 5952. Springer, Berlin, pp 3–17. doi:10.1007/978-3-642-11515-8_3
Awasthi M, Nellans DW, Sudan K, Balasubramonian R, Davis A (2010) Handling the problems and opportunities posed by multiple on-chip memory controllers. In: Proceedings of the 19th int conf parallel architectures and compilation techniques (PACT 10), pp 319–330. doi:10.1145/1854273.1854314
Abts D, Jerger NDE, Kim J, Gibson D, Lipasti MH (2009) Achieving predictable performance through better memory controller placement in many-core CMPs. In: Proceedings of the 36th int symp computer architecture (ISCA 09), pp 451–461. doi:10.1145/1555754.1555810
Lin X-Y, Huang C-Y, Yang P-M, Lung T-W, Tseng S-Y, Chung Y-C (2011) Parallelization of motion JPEG decoder on TILE64 many-core platform. In: Hsu C-H, Malyshkin V (eds) Methods and tools of parallel programming multicomputers. Lecture notes in computer science, vol 6083. Springer, Berlin, pp 59–68. doi:10.1007/978-3-642-14822-4_7
Jackson JD, Hatcher PJ (2011) Efficient parallel execution of sequence similarity analysis via dynamic load balancing. In: Proceedings of the ISCA 3rd int conf bioinformatics and computational biology (BICoB 11), pp 219–224
Goux JP, Kulkarni S, Linderoth J, Yoder M (2000) An enabling framework for master–worker applications on the computational grid. In: Proceedings of the 9th int symp high-performance distributed computing (HDPC 00), pp 43–50
Fujimoto RM, Malik AW, Park A (2010) Parallel and distributed simulation in the cloud. SCS M&S Mag 1(3):1–10
Rynge M, Callaghan S, Deelman E, Juve G, Mehta G, Vahi K, Maechling PJ (2012) Enabling large-scale scientific workflows on petascale resources using MPI master/worker. In: Proceedings of the 1st conf extreme science and engineering discovery environment (XSEDE 12), pp 1–8. doi:10.1145/2335755.2335846
Blagojevic F, Nikolopoulos DS, Stamatakis A, Antonopoulos CD (2007) Dynamic multigrain parallelization on the cell broadband engine. In: Proceedings of the 12th ACM SIGPLAN symp principles and practice of parallel programming, pp 90–100. doi:10.1145/1229428.1229445
Zheng G, Meneses E, Bhatelé A, Kalé LV (2010) Hierarchical load balancing for Charm++ applications on large supercomputers. In: Proceedings of the 39th int conf parallel processing workshops (ICPPW 10), pp 436–444. doi:10.1109/ICPPW.2010.65
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113. doi:10.1145/1327452.1327492
Giseok C, Jeongsoo Y, Jeonghoon C, Jongho N (2007) Design and implementation of a real-time video player on tiled-display system. In: Proceedings of the 7th IEEE int conf computer and information technology (CIT 07), pp 621–626
Nunome T, Tasaka S (2004) Application-level QoS assessment of continuous media multicasting in a wireless ad hoc network. In: Proceedings of the 2004 IEEE int conf communications, pp 2047–2053
Pereira R, Azambuja M, Breitman K, Endler M (2010) An architecture for distributed high performance video processing in the cloud. In: Proceedings of the 3rd IEEE int conf cloud computing (CLOUD 10), pp 482–489
Ali U, Bilal M (2006) Video based parallel face recognition using Gabor filter on homogeneous distributed systems. In: Proceedings of the 2006 IEEE int conf engineering of intelligent systems, pp 1–5
MJPEG Tools. http://mjpeg.sourceforge.net
Wang Z, Liang L, Yang G, Zhang X, Sun J, Zhao D, Gao W (2011) A novel macro-block group based AVS coding scheme for many-core processor. J Signal Process Syst 65(1):129–145. doi:10.1007/s11265-010-0543-0
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Lin, XY., Lai, KC., Li, KC. et al. Efficient programming paradigm for video streaming processing on TILE64 platform. J Supercomput 65, 823–847 (2013). https://doi.org/10.1007/s11227-012-0867-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-012-0867-6