Abstract
The rapid advance of computer hardware and popularity of multimedia applications enable multi-core processors with sub-word parallelism instructions to become a dominant market trend in desk-top PCs as well as high end mobile devices. This paper presents an efficient parallel implementation of 2D convolution algorithm demanding high performance computing power in multi-core desktop PCs. It is a representative computation intensive algorithm, in image and signal processing applications, accompanied by heavy memory access; on the other hand, their computational complexities are relatively low. The purpose of this study is to explore the effectiveness of exploiting the streaming SIMD (Single Instruction Multiple Data) extension (SSE) technology and TBB (Threading Building Block) run-time library in Intel multi-core processors. By doing so, we can take advantage of all the hardware features of multi-core processor concurrently for data- and task-level parallelism. For the performance evaluation, we implemented a 3 × 3 kernel based convolution algorithm using SSE2 and TBB with different combinations and compared their processing speeds. The experimental results show that both technologies have a significant effect on the performance and the processing speed can be greatly improved when using two technologies at the same time; for example, 6.2, 6.1, and 1.4 times speedup compared with the implementation of either of them are suggested for 256 × 256, 512 × 512, and 1024 × 1024 data sets, respectively.










Similar content being viewed by others
References
Akhter S, Roberts J (2006) Multi-core programming: increasing performance through software multi-threading. Intel Press
Baker CG, Carter Edwards H, Heroux MA, Williams AB (2010) A light-weight api for portable multicore programming. In Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, Washington, DC, USA, 2010
Bosi B, Bois G, Savaria Y (1999) Reconfigurable pipelined 2D convolvers for fast digital signal processing. IEEE Transactions on VLSI Systems 7(3):299–308
Chhugani J, Macy M, Baransi A, Nguyen AD, Hagog M, Kumar S, Lee VW, Chen YK (2008) Efficient implementation of sorting on multi-core SIMD CPU architecture. Pradeep Dubey Journal: Proceedings of the VLDB Endowment 1(2):1313–1324
Contreras G, Martonosi M (2008) Characterizing and improving the performance of Intel threading building blocks. International Symposium on Workload Characterization (IISWC'08), September 2008. pp 1–10
David M, Vasco S, Martin MD, Ken R, Peter C (2009) Digital signal processing on Intel architecture. Intel Press
Diefendorff K, Dubey PK, Hochsprung R, Scale H (2000) AltiVec extension to PowerPC accelerates media processing. IEEE Micro 20(2):85–95
Falcou J, Sérot J, Chateau T, Lapresté J-T (2006) Quaff: efficient C++ design for parallel skeletons. Parallel Computing 32(7–8):604–615
Gonzalez R, Woods R (2002) Digital image processing, 2nd edn. Prentice-Hall, Englewood Cliffs
Hecht V, Rönner K, Pirsch P (1991) An advanced programmable 2D convolution chip for real time image processing. In Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), pp 1897–1900
Hennessy JL, Patterson DA (2003) Computer architecture: a quantitative approach, 3rd edn. Morgan-Kaufmann
Kayi A, Yao Y, El-Ghazawi T, Newby G (2007) Experimental evaluation of emerging multi-core architectures. In Proceeding of IPDPS 2007:1–6
Kim WY, Voss M (2011) Multicore desktop programming with Intel threading building blocks. IEEE Softw 2011:23–31
Kim CG, Kim HS, Kang SH, Kim SD, Han GH (2004) An acceleration processor for data intensive scientific computing. IEICE Trans Inf Syst E87-D:1766–1773
Kirschenmann W, Plagne L, Vialle S (2010) Multi-target vectorization with MTPS C++ generic library. In PARA 2010: State of the Art in Scientific and Parallel Computing, Iceland Reykjavik, June 2010
Kohn L, Maturana G, Tremblay M, Prabhu A, Zyner G (1995) The visual instruction set (VIS) in UltraSPARC (Compcon 95). Technologies for the Information Superhighway, Digest of Papers, pp 462–469
Lee RB, Fiskiran AM (2002) Multimedia instructions in microprocessors for native signal processing. Programmable Digital Signal Processors: Architecture, Programming, and Applications, Marcel Dekker, pp 91–145
Ma WC, Yang CL (2002) Using intel streaming SIMD extensions for 3D geometry processing. Proceedings of the 3rd IEEE Pacific-Rim Conference on Multimedia Processing
Nicole R (2001) Desktop performance and optimization for Intel® Pentium® 4 Processor, founded at ftp://download.intel.com/design/pentium4/papers/24943801.pdf
Oberman S, Favor G, Weber F (1999) AMD 3D now! Technology: architecture and implementations. IEEE Micro 19(2):37–48
Paxson V, Sommer R, Weaver N (2007) An architecture for exploiting multi-core processors to parallelize network intrusion prevention. In Proceeding of IEEE Sarnoff Symposium 2007:1–7
Peleg A, Weiser U (1996) MMX technology extension to the Intel architecture. IEEE Micro 16(4):42–50
Perria S, Lanuzzaa M, Corsonellob P, Cocorulloa G (2005) A high-performance fully reconfigurable FPGA-based 2D convolution processor. Microprocessors and Microsystems 29:381–391
Reinders J (2007) Intel threading building blocks. O’Reilly, Sebastopol
Robison A, Voss M, Kukanov A (2008) Optimization via reflection on work stealing in TBB. IEEE International Symposium on Parallel and Distributed Processing (IPDPS 2008), pp 1–8
Acknowledgements
Funding for this paper was provided by Namseoul University.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Kim, C.G., Kim, J.G. & Lee, D.H. Optimizing image processing on multi-core CPUs with Intel parallel programming technologies. Multimed Tools Appl 68, 237–251 (2014). https://doi.org/10.1007/s11042-011-0906-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-011-0906-y