Abstract
Accessing pixels in memory is a well-known bottleneck of SIMD (single instruction multiple data) processors in video/imaging. To tackle it, we propose new block and row access modes of parallel on-chip memory subsystem, which enable a higher processing throughput and lower energy consumption than the access modes of the state-of-the-art subsystems. The new access modes significantly reduce the number of on-chip memory accesses, and thereby accelerate one of key video/imaging kernels: sub-pixel block-matching motion estimation. The main idea is to exploit spatial overlaps of blocks/rows accessed for pixel interpolation, which are known at the subsystem design-time, and merge multiple accesses into a single one by accessing somewhat more pixels at a time than with other parallel memories. To avoid the need for a wider, and, therefore, more costly SIMD datapath, we propose new memory read operations that split all pixels accessed at a time into multiple SIMD-wide blocks/rows, in a convenient way for further processing. As a proof of concept, we describe a parametric, scalable, and cost-efficient architecture that supports the new access modes. The architecture is based on a previously proposed set of memory banks with multiple pixels per bank word, and a previously proposed shifted scheme for arranging pixels in the banks. We analytically and experimentally demonstrate advantages of this work on a case study of sub-pixel motion estimation for video frame-rate conversion. The implemented motion estimator processes 2160p video at 60 fps in real time, while clocked at 600 MHz. Compared to the implementations based on the state-of-the-art subsystems, this work enables 40–70 % higher throughput, consumes 17–44 % less energy and has similar silicon area and off-chip memory bandwidth costs. That is 1.8–2.9 times more efficient than the prior art, considering the throughput and all costs, i.e., consumption, area, and off-chip bandwidth. Such a higher efficiency is the result of the new access modes, which reduced the number of on-chip memory accesses by 1.6–2.1 times, and the cost-efficient architecture.
Similar content being viewed by others
References
de Haan, G.: Digital video post processing. Royal Philips Electronics, Eindhoven (2010)
Wiegand, T., Sullivan, G., Bjontegaard, G., Luthra, A.: Overview of the H.264/AVC video coding standard. Circuits Syst. Video Technol. IEEE Trans. 13(7), 560–576 (2003)
Puglisi, G., Battiato, S.: A robust image alignment algorithm for video stabilization purposes. Circuits Syst. Video Technol. IEEE Trans. 21(10), 1390–1400 (2011)
Lukac, R.: Computational photography: methods and applications, 1st edn. CRC Press Inc, Boca Raton (2010)
Woh, M., Mahlke, S., Mudge, T., Chakrabarti, C.: Mobile supercomputers for the next-generation cell phone. Computer 43(1), 81–85 (2010)
Hwangbo, W., Kyung, C.-M.: A multitransform architecture for H.264/AVC high-profile coders. Multimed. IEEE Trans. 12(3), 157–167 (2010)
Kozyrakis, C., Patterson, D.: Scalable, vector processors for embedded systems. Micro IEEE 23(6), 36–45 (2003)
Hennessy, J., Patterson, D.: Computer architecture: a quantitative approach. Morgan Kaufmann Publishers Inc., Burlington (2011)
Denolf, K., De Vleeschouwer, C., Turney, R., Lafruit, G., Bormans, J.: Memory centric design of an MPEG-4 video encoder. Circuits Syst. Video Technol. IEEE Trans. 15(5), 609–619 (2005)
Catthoor, F., Greef, E.D., Suytack, S.: Custom memory management methodology: exploration of memory organisation for embedded multimedia system design. Kluwer Academic Publishers, Norwell (1998)
Di Salvo, R., Pino, C.: Image and video processing on GPU: implementation scheme, applications and future directions. In: Advances in Mechanical and Electronic Engineering. Springer, Berlin, pp. 375–382 (2013)
González, D., Botella, G., García, C., Prieto, M., Tirado, F.: Acceleration of block-matching algorithms using a custom instruction-based paradigm on a Nios II microprocessor. EURASIP J. Adv. Signal Process. 2013(1), 1–20 (2013)
Nguyen, A.H., Pickering, M.R., Lambert, A.: The FPGA implementation of a one-bit-per-pixel image registration algorithm. J. Real Time Image Process 1–17(2014)
xiao Li, D., Zheng, W., Zhang, M.: Architecture design for H.264/AVC integer motion estimation with minimum memory bandwidth. Consumer Electron. IEEE Trans. 53(3), 1053–1060 (2007)
Ho, H., Klepko, R., Ninh, N., Wang, D.: A high performance hardware architecture for multi-frame hierarchical motion estimation. Consumer Electron. IEEE Trans. 57(2), 794–801 (2011)
Pastuszak, G., Jakubowski, M.: Adaptive computationally scalable motion estimation for the hardware H.264/AVC encoder. Circuits Syst. Video Technol. IEEE Trans. 23(5), 802–812 (2013)
Pastuszak, G., Trochimiuk, M.: Architecture design of the high-throughput compensator and interpolator for the H.265/HEVC encoder. J. Real Time Image Process. 1–11 (2014)
Chor, B., Leiserson, C.E., Rivest, R.L.: An application of number theory to the organization of raster-graphics memory. In: Foundations of Computer Science, 1982. SFCS ’08. 23rd Annual Symposium, pp. 92–99 (1982)
Budnik, P., Kuck, D.: The organization and use of parallel memories. Comput. IEEE Trans. 100(12), 1566–1569 (1971)
Lawrie, D., Vora, C.: The prime memory system for array access. Comput. IEEE Trans. 31(5), 435–442 (1982)
Lee, D.: Scrambled storage for parallel memory systems. In: Computer Architecture, 1988. Conference Proceedings. 15th Annual International Symposium, pp. 232–239 (1988)
Park, J.W.: An efficient buffer memory system for subarray access. Parallel Distrib. Syst. IEEE Trans. 12(3), 316–335 (2001)
Stolberg, H.J., Berekovic, M., Friebe, L., Moch, S., Flugel, S., Mao, X., Kulaczewski, M., Klussmann, H., Pirsch, P.: HiBRID-SoC: a multi-core system-on-chip architecture for multimedia signal processing applications. In: Design, Automation and Test in Europe Conference and Exhibition, pp. 8–13 (2003)
Liu, C., Yan, X., Qin, X.: An optimized linear skewing interleave scheme for on-chip multi-access memory systems. In: Proceedings of the 17th ACM Great Lakes symposium on VLSI, ser. GLSVLSI ’07. ACM, New York, pp. 8–13 (2007) (Online). doi:10.1145/1228784.1228793
Liu, S., Chen, S., Chen, H., Guo, Y.: A novel parallel memory organization supporting multiple access types with matched memory modules. IEICE Electron. Express 9(6), 602–608 (2012)
Tanskanen, J.K., Creutzburg, R., Niittylahti, J.T.: On design of parallel memory access schemes for video coding. J. VLSI Signal Process. Syst. 40(2), 215–237 (2005)
Aho, E., Vanne, J., Hamalainen, T.: Parallel memory architecture for arbitrary stride accesses. In: Design and Diagnostics of Electronic Circuits and systems, IEEE, pp. 63–68 (2006)
Kuzmanov, G., Gaydadjiev, G., Vassiliadis, S.: Multimedia rectangularly addressable memory. Multimed. IEEE Trans. 8(2), 315–322 (2006)
Peng, J.-Y., Yan, X.-L., Li, D.-X., Chen, L.-Z.: A parallel memory architecture for video coding. J. Zhejiang Univ. Sci. A 9, 1644–1655 (2008)
Vanne, J., Aho, E., Hamalainen, T., Kuusilinna, K.: A parallel memory system for variable block-size motion estimation algorithms. Circuits Syst. Video Technol. IEEE Trans. 18(4), 538–543 (2008)
Lentaris, G., Reisis, D.: A graphics parallel memory organization exploiting request correlations. Comput. IEEE Trans. 59(6), 762–775 (2010)
Lo, W.-Y., Lun, D., Siu, W.-C., Wang, W., Song, J.: Improved SIMD architecture for high performance video processors. Circuits Syst. Video Technol. IEEE Trans. 21(12), 1769–1783 (2011)
Beric, A., van Meerbergen, J., de Haan, G., Sethuraman, R.: Memory-centric video processing. Circuits Syst. Video Technol. IEEE Trans. 18(4), 439–452 (2008)
Kelly, F., Kokaram, A.: Fast image interpolation for motion estimation using graphics hardware. In: Electronic Imaging. International Society for Optics and Photonics, pp. 184–194 (2004)
Gupta, P., Korada, R.: Novel algorithm to reduce the complexity of quarter-pixel motion estimation. In: Electronic Imaging. International Society for Optics and Photonics, pp. 31–36 (2004)
Tsung, P.K., Chen, W.Y., Ding, L.F., Tsai, C.Y., Chuang, T.D., Chen, L.G.: Single-iteration full-search fractional motion estimation for quad full HD H.264/AVC encoding. In: Multimedia and Expo, 2009. ICME 2009. IEEE International Conference, pp. 9–12 (2009)
de Haan, G., Biezen, P.W.: Sub-pixel motion estimation with 3-D recursive search block-matching. Signal Process. Image Commun. 6(3), 229–239 (1994)
Beric, A.: Video post processing architectures. Ph.D. dissertation, Eindhoven University of Technology, The Netherlands (2008)
Jaspers, E., de With P.: Bandwidth reduction for video processing in consumer systems. In: Consumer Electronics, 2001. ICCE. International Conference, pp. 72–73 (2001)
Jakovljevic, R., Beric, A.: A method for improving the efficiency of a two-level memory hierarchy. In: Signal Processing Systems, 2008. SiPS 2008. IEEE Workshop, pp. 37–42 (2008)
Burns, G., Jacobs, M., Lindwer, M., Vandewiele, B.: Silicon Hive’s scalable and modular architecture template for high-performance multi-core systems. In: Proceedings of International Signal Processing Conference and Expo (2006)
Pinto, C., Beric, A., Singh, S., Farfade, S.: HiveFlex-Video VSP1: video signal processing architecture for video coding and post-processing. In: Multimedia, 2006. ISM’06. Eighth IEEE International Symposium, pp. 493–500 (2006)
Augusteijn, L.: The HiveCC Compiler for Massively Parallel ULIW Cores. In: Embedded Processor Forum. San Jose (2004)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jakovljević, R., Berić, A., van Dalen, E. et al. New access modes of parallel memory subsystem for sub-pixel motion estimation. J Real-Time Image Proc 15, 279–296 (2018). https://doi.org/10.1007/s11554-014-0481-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11554-014-0481-3