New access modes of parallel memory subsystem for sub-pixel motion estimation

Jakovljević, Radomir; Berić, Aleksandar; van Dalen, Edwin; Milićev, Dragan

doi:10.1007/s11554-014-0481-3

New access modes of parallel memory subsystem for sub-pixel motion estimation

Original Research Paper
Published: 30 December 2014

Volume 15, pages 279–296, (2018)
Cite this article

Journal of Real-Time Image Processing Aims and scope Submit manuscript

Radomir Jakovljević^1,2,
Aleksandar Berić²,
Edwin van Dalen² &
…
Dragan Milićev¹

187 Accesses
4 Citations
Explore all metrics

Abstract

Accessing pixels in memory is a well-known bottleneck of SIMD (single instruction multiple data) processors in video/imaging. To tackle it, we propose new block and row access modes of parallel on-chip memory subsystem, which enable a higher processing throughput and lower energy consumption than the access modes of the state-of-the-art subsystems. The new access modes significantly reduce the number of on-chip memory accesses, and thereby accelerate one of key video/imaging kernels: sub-pixel block-matching motion estimation. The main idea is to exploit spatial overlaps of blocks/rows accessed for pixel interpolation, which are known at the subsystem design-time, and merge multiple accesses into a single one by accessing somewhat more pixels at a time than with other parallel memories. To avoid the need for a wider, and, therefore, more costly SIMD datapath, we propose new memory read operations that split all pixels accessed at a time into multiple SIMD-wide blocks/rows, in a convenient way for further processing. As a proof of concept, we describe a parametric, scalable, and cost-efficient architecture that supports the new access modes. The architecture is based on a previously proposed set of memory banks with multiple pixels per bank word, and a previously proposed shifted scheme for arranging pixels in the banks. We analytically and experimentally demonstrate advantages of this work on a case study of sub-pixel motion estimation for video frame-rate conversion. The implemented motion estimator processes 2160p video at 60 fps in real time, while clocked at 600 MHz. Compared to the implementations based on the state-of-the-art subsystems, this work enables 40–70 % higher throughput, consumes 17–44 % less energy and has similar silicon area and off-chip memory bandwidth costs. That is 1.8–2.9 times more efficient than the prior art, considering the throughput and all costs, i.e., consumption, area, and off-chip bandwidth. Such a higher efficiency is the result of the new access modes, which reduced the number of on-chip memory accesses by 1.6–2.1 times, and the cost-efficient architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Article 27 April 2021

Recent progress in InGaZnO FETs for high-density 2T0C DRAM applications

Article 21 September 2023

A Modern Primer on Processing in Memory

References

de Haan, G.: Digital video post processing. Royal Philips Electronics, Eindhoven (2010)
Google Scholar
Wiegand, T., Sullivan, G., Bjontegaard, G., Luthra, A.: Overview of the H.264/AVC video coding standard. Circuits Syst. Video Technol. IEEE Trans. 13(7), 560–576 (2003)
Article Google Scholar
Puglisi, G., Battiato, S.: A robust image alignment algorithm for video stabilization purposes. Circuits Syst. Video Technol. IEEE Trans. 21(10), 1390–1400 (2011)
Article Google Scholar
Lukac, R.: Computational photography: methods and applications, 1st edn. CRC Press Inc, Boca Raton (2010)
Google Scholar
Woh, M., Mahlke, S., Mudge, T., Chakrabarti, C.: Mobile supercomputers for the next-generation cell phone. Computer 43(1), 81–85 (2010)
Article Google Scholar
Hwangbo, W., Kyung, C.-M.: A multitransform architecture for H.264/AVC high-profile coders. Multimed. IEEE Trans. 12(3), 157–167 (2010)
Article Google Scholar
Kozyrakis, C., Patterson, D.: Scalable, vector processors for embedded systems. Micro IEEE 23(6), 36–45 (2003)
Article Google Scholar
Hennessy, J., Patterson, D.: Computer architecture: a quantitative approach. Morgan Kaufmann Publishers Inc., Burlington (2011)
MATH Google Scholar
Denolf, K., De Vleeschouwer, C., Turney, R., Lafruit, G., Bormans, J.: Memory centric design of an MPEG-4 video encoder. Circuits Syst. Video Technol. IEEE Trans. 15(5), 609–619 (2005)
Article Google Scholar
Catthoor, F., Greef, E.D., Suytack, S.: Custom memory management methodology: exploration of memory organisation for embedded multimedia system design. Kluwer Academic Publishers, Norwell (1998)
Book MATH Google Scholar
Di Salvo, R., Pino, C.: Image and video processing on GPU: implementation scheme, applications and future directions. In: Advances in Mechanical and Electronic Engineering. Springer, Berlin, pp. 375–382 (2013)
González, D., Botella, G., García, C., Prieto, M., Tirado, F.: Acceleration of block-matching algorithms using a custom instruction-based paradigm on a Nios II microprocessor. EURASIP J. Adv. Signal Process. 2013(1), 1–20 (2013)
Article Google Scholar
Nguyen, A.H., Pickering, M.R., Lambert, A.: The FPGA implementation of a one-bit-per-pixel image registration algorithm. J. Real Time Image Process 1–17(2014)
xiao Li, D., Zheng, W., Zhang, M.: Architecture design for H.264/AVC integer motion estimation with minimum memory bandwidth. Consumer Electron. IEEE Trans. 53(3), 1053–1060 (2007)
Article Google Scholar
Ho, H., Klepko, R., Ninh, N., Wang, D.: A high performance hardware architecture for multi-frame hierarchical motion estimation. Consumer Electron. IEEE Trans. 57(2), 794–801 (2011)
Article Google Scholar
Pastuszak, G., Jakubowski, M.: Adaptive computationally scalable motion estimation for the hardware H.264/AVC encoder. Circuits Syst. Video Technol. IEEE Trans. 23(5), 802–812 (2013)
Article Google Scholar
Pastuszak, G., Trochimiuk, M.: Architecture design of the high-throughput compensator and interpolator for the H.265/HEVC encoder. J. Real Time Image Process. 1–11 (2014)
Chor, B., Leiserson, C.E., Rivest, R.L.: An application of number theory to the organization of raster-graphics memory. In: Foundations of Computer Science, 1982. SFCS ’08. 23rd Annual Symposium, pp. 92–99 (1982)
Budnik, P., Kuck, D.: The organization and use of parallel memories. Comput. IEEE Trans. 100(12), 1566–1569 (1971)
Article MATH Google Scholar
Lawrie, D., Vora, C.: The prime memory system for array access. Comput. IEEE Trans. 31(5), 435–442 (1982)
Article MATH Google Scholar
Lee, D.: Scrambled storage for parallel memory systems. In: Computer Architecture, 1988. Conference Proceedings. 15th Annual International Symposium, pp. 232–239 (1988)
Park, J.W.: An efficient buffer memory system for subarray access. Parallel Distrib. Syst. IEEE Trans. 12(3), 316–335 (2001)
Article MathSciNet Google Scholar
Stolberg, H.J., Berekovic, M., Friebe, L., Moch, S., Flugel, S., Mao, X., Kulaczewski, M., Klussmann, H., Pirsch, P.: HiBRID-SoC: a multi-core system-on-chip architecture for multimedia signal processing applications. In: Design, Automation and Test in Europe Conference and Exhibition, pp. 8–13 (2003)
Liu, C., Yan, X., Qin, X.: An optimized linear skewing interleave scheme for on-chip multi-access memory systems. In: Proceedings of the 17th ACM Great Lakes symposium on VLSI, ser. GLSVLSI ’07. ACM, New York, pp. 8–13 (2007) (Online). doi:10.1145/1228784.1228793
Liu, S., Chen, S., Chen, H., Guo, Y.: A novel parallel memory organization supporting multiple access types with matched memory modules. IEICE Electron. Express 9(6), 602–608 (2012)
Article Google Scholar
Tanskanen, J.K., Creutzburg, R., Niittylahti, J.T.: On design of parallel memory access schemes for video coding. J. VLSI Signal Process. Syst. 40(2), 215–237 (2005)
Article Google Scholar
Aho, E., Vanne, J., Hamalainen, T.: Parallel memory architecture for arbitrary stride accesses. In: Design and Diagnostics of Electronic Circuits and systems, IEEE, pp. 63–68 (2006)
Kuzmanov, G., Gaydadjiev, G., Vassiliadis, S.: Multimedia rectangularly addressable memory. Multimed. IEEE Trans. 8(2), 315–322 (2006)
Article Google Scholar
Peng, J.-Y., Yan, X.-L., Li, D.-X., Chen, L.-Z.: A parallel memory architecture for video coding. J. Zhejiang Univ. Sci. A 9, 1644–1655 (2008)
Article Google Scholar
Vanne, J., Aho, E., Hamalainen, T., Kuusilinna, K.: A parallel memory system for variable block-size motion estimation algorithms. Circuits Syst. Video Technol. IEEE Trans. 18(4), 538–543 (2008)
Article Google Scholar
Lentaris, G., Reisis, D.: A graphics parallel memory organization exploiting request correlations. Comput. IEEE Trans. 59(6), 762–775 (2010)
Article MathSciNet Google Scholar
Lo, W.-Y., Lun, D., Siu, W.-C., Wang, W., Song, J.: Improved SIMD architecture for high performance video processors. Circuits Syst. Video Technol. IEEE Trans. 21(12), 1769–1783 (2011)
Article Google Scholar
Beric, A., van Meerbergen, J., de Haan, G., Sethuraman, R.: Memory-centric video processing. Circuits Syst. Video Technol. IEEE Trans. 18(4), 439–452 (2008)
Article Google Scholar
Kelly, F., Kokaram, A.: Fast image interpolation for motion estimation using graphics hardware. In: Electronic Imaging. International Society for Optics and Photonics, pp. 184–194 (2004)
Gupta, P., Korada, R.: Novel algorithm to reduce the complexity of quarter-pixel motion estimation. In: Electronic Imaging. International Society for Optics and Photonics, pp. 31–36 (2004)
Tsung, P.K., Chen, W.Y., Ding, L.F., Tsai, C.Y., Chuang, T.D., Chen, L.G.: Single-iteration full-search fractional motion estimation for quad full HD H.264/AVC encoding. In: Multimedia and Expo, 2009. ICME 2009. IEEE International Conference, pp. 9–12 (2009)
de Haan, G., Biezen, P.W.: Sub-pixel motion estimation with 3-D recursive search block-matching. Signal Process. Image Commun. 6(3), 229–239 (1994)
Article Google Scholar
Beric, A.: Video post processing architectures. Ph.D. dissertation, Eindhoven University of Technology, The Netherlands (2008)
Jaspers, E., de With P.: Bandwidth reduction for video processing in consumer systems. In: Consumer Electronics, 2001. ICCE. International Conference, pp. 72–73 (2001)
Jakovljevic, R., Beric, A.: A method for improving the efficiency of a two-level memory hierarchy. In: Signal Processing Systems, 2008. SiPS 2008. IEEE Workshop, pp. 37–42 (2008)
Burns, G., Jacobs, M., Lindwer, M., Vandewiele, B.: Silicon Hive’s scalable and modular architecture template for high-performance multi-core systems. In: Proceedings of International Signal Processing Conference and Expo (2006)
Pinto, C., Beric, A., Singh, S., Farfade, S.: HiveFlex-Video VSP1: video signal processing architecture for video coding and post-processing. In: Multimedia, 2006. ISM’06. Eighth IEEE International Symposium, pp. 493–500 (2006)
Augusteijn, L.: The HiveCC Compiler for Massively Parallel ULIW Cores. In: Embedded Processor Forum. San Jose (2004)

Download references

Author information

Authors and Affiliations

School of Electrical Engineering, University of Belgrade, Belgrade, Serbia
Radomir Jakovljević & Dragan Milićev
Intel Corporation, Santa Clara, USA
Radomir Jakovljević, Aleksandar Berić & Edwin van Dalen

Authors

Radomir Jakovljević
View author publications
You can also search for this author in PubMed Google Scholar
Aleksandar Berić
View author publications
You can also search for this author in PubMed Google Scholar
Edwin van Dalen
View author publications
You can also search for this author in PubMed Google Scholar
Dragan Milićev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Radomir Jakovljević.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jakovljević, R., Berić, A., van Dalen, E. et al. New access modes of parallel memory subsystem for sub-pixel motion estimation. J Real-Time Image Proc 15, 279–296 (2018). https://doi.org/10.1007/s11554-014-0481-3

Download citation

Received: 06 September 2014
Accepted: 05 December 2014
Published: 30 December 2014
Issue Date: August 2018
DOI: https://doi.org/10.1007/s11554-014-0481-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

New access modes of parallel memory subsystem for sub-pixel motion estimation

Abstract

Access this article

Similar content being viewed by others

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Recent progress in InGaZnO FETs for high-density 2T0C DRAM applications

A Modern Primer on Processing in Memory

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

New access modes of parallel memory subsystem for sub-pixel motion estimation

Abstract

Access this article

Similar content being viewed by others

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

Recent progress in InGaZnO FETs for high-density 2T0C DRAM applications

A Modern Primer on Processing in Memory

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation