Skip to main content

Advertisement

Log in

New access modes of parallel memory subsystem for sub-pixel motion estimation

  • Original Research Paper
  • Published:
Journal of Real-Time Image Processing Aims and scope Submit manuscript

Abstract

Accessing pixels in memory is a well-known bottleneck of SIMD (single instruction multiple data) processors in video/imaging. To tackle it, we propose new block and row access modes of parallel on-chip memory subsystem, which enable a higher processing throughput and lower energy consumption than the access modes of the state-of-the-art subsystems. The new access modes significantly reduce the number of on-chip memory accesses, and thereby accelerate one of key video/imaging kernels: sub-pixel block-matching motion estimation. The main idea is to exploit spatial overlaps of blocks/rows accessed for pixel interpolation, which are known at the subsystem design-time, and merge multiple accesses into a single one by accessing somewhat more pixels at a time than with other parallel memories. To avoid the need for a wider, and, therefore, more costly SIMD datapath, we propose new memory read operations that split all pixels accessed at a time into multiple SIMD-wide blocks/rows, in a convenient way for further processing. As a proof of concept, we describe a parametric, scalable, and cost-efficient architecture that supports the new access modes. The architecture is based on a previously proposed set of memory banks with multiple pixels per bank word, and a previously proposed shifted scheme for arranging pixels in the banks. We analytically and experimentally demonstrate advantages of this work on a case study of sub-pixel motion estimation for video frame-rate conversion. The implemented motion estimator processes 2160p video at 60 fps in real time, while clocked at 600 MHz. Compared to the implementations based on the state-of-the-art subsystems, this work enables 40–70 % higher throughput, consumes 17–44 % less energy and has similar silicon area and off-chip memory bandwidth costs. That is 1.8–2.9 times more efficient than the prior art, considering the throughput and all costs, i.e., consumption, area, and off-chip bandwidth. Such a higher efficiency is the result of the new access modes, which reduced the number of on-chip memory accesses by 1.6–2.1 times, and the cost-efficient architecture.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. de Haan, G.: Digital video post processing. Royal Philips Electronics, Eindhoven (2010)

    Google Scholar 

  2. Wiegand, T., Sullivan, G., Bjontegaard, G., Luthra, A.: Overview of the H.264/AVC video coding standard. Circuits Syst. Video Technol. IEEE Trans. 13(7), 560–576 (2003)

    Article  Google Scholar 

  3. Puglisi, G., Battiato, S.: A robust image alignment algorithm for video stabilization purposes. Circuits Syst. Video Technol. IEEE Trans. 21(10), 1390–1400 (2011)

    Article  Google Scholar 

  4. Lukac, R.: Computational photography: methods and applications, 1st edn. CRC Press Inc, Boca Raton (2010)

    Google Scholar 

  5. Woh, M., Mahlke, S., Mudge, T., Chakrabarti, C.: Mobile supercomputers for the next-generation cell phone. Computer 43(1), 81–85 (2010)

    Article  Google Scholar 

  6. Hwangbo, W., Kyung, C.-M.: A multitransform architecture for H.264/AVC high-profile coders. Multimed. IEEE Trans. 12(3), 157–167 (2010)

    Article  Google Scholar 

  7. Kozyrakis, C., Patterson, D.: Scalable, vector processors for embedded systems. Micro IEEE 23(6), 36–45 (2003)

    Article  Google Scholar 

  8. Hennessy, J., Patterson, D.: Computer architecture: a quantitative approach. Morgan Kaufmann Publishers Inc., Burlington (2011)

    MATH  Google Scholar 

  9. Denolf, K., De Vleeschouwer, C., Turney, R., Lafruit, G., Bormans, J.: Memory centric design of an MPEG-4 video encoder. Circuits Syst. Video Technol. IEEE Trans. 15(5), 609–619 (2005)

    Article  Google Scholar 

  10. Catthoor, F., Greef, E.D., Suytack, S.: Custom memory management methodology: exploration of memory organisation for embedded multimedia system design. Kluwer Academic Publishers, Norwell (1998)

    Book  MATH  Google Scholar 

  11. Di Salvo, R., Pino, C.: Image and video processing on GPU: implementation scheme, applications and future directions. In: Advances in Mechanical and Electronic Engineering. Springer, Berlin, pp. 375–382 (2013)

  12. González, D., Botella, G., García, C., Prieto, M., Tirado, F.: Acceleration of block-matching algorithms using a custom instruction-based paradigm on a Nios II microprocessor. EURASIP J. Adv. Signal Process. 2013(1), 1–20 (2013)

    Article  Google Scholar 

  13. Nguyen, A.H., Pickering, M.R., Lambert, A.: The FPGA implementation of a one-bit-per-pixel image registration algorithm. J. Real Time Image Process 1–17(2014)

  14. xiao Li, D., Zheng, W., Zhang, M.: Architecture design for H.264/AVC integer motion estimation with minimum memory bandwidth. Consumer Electron. IEEE Trans. 53(3), 1053–1060 (2007)

    Article  Google Scholar 

  15. Ho, H., Klepko, R., Ninh, N., Wang, D.: A high performance hardware architecture for multi-frame hierarchical motion estimation. Consumer Electron. IEEE Trans. 57(2), 794–801 (2011)

    Article  Google Scholar 

  16. Pastuszak, G., Jakubowski, M.: Adaptive computationally scalable motion estimation for the hardware H.264/AVC encoder. Circuits Syst. Video Technol. IEEE Trans. 23(5), 802–812 (2013)

    Article  Google Scholar 

  17. Pastuszak, G., Trochimiuk, M.: Architecture design of the high-throughput compensator and interpolator for the H.265/HEVC encoder. J. Real Time Image Process. 1–11 (2014)

  18. Chor, B., Leiserson, C.E., Rivest, R.L.: An application of number theory to the organization of raster-graphics memory. In: Foundations of Computer Science, 1982. SFCS ’08. 23rd Annual Symposium, pp. 92–99 (1982)

  19. Budnik, P., Kuck, D.: The organization and use of parallel memories. Comput. IEEE Trans. 100(12), 1566–1569 (1971)

    Article  MATH  Google Scholar 

  20. Lawrie, D., Vora, C.: The prime memory system for array access. Comput. IEEE Trans. 31(5), 435–442 (1982)

    Article  MATH  Google Scholar 

  21. Lee, D.: Scrambled storage for parallel memory systems. In: Computer Architecture, 1988. Conference Proceedings. 15th Annual International Symposium, pp. 232–239 (1988)

  22. Park, J.W.: An efficient buffer memory system for subarray access. Parallel Distrib. Syst. IEEE Trans. 12(3), 316–335 (2001)

    Article  MathSciNet  Google Scholar 

  23. Stolberg, H.J., Berekovic, M., Friebe, L., Moch, S., Flugel, S., Mao, X., Kulaczewski, M., Klussmann, H., Pirsch, P.: HiBRID-SoC: a multi-core system-on-chip architecture for multimedia signal processing applications. In: Design, Automation and Test in Europe Conference and Exhibition, pp. 8–13 (2003)

  24. Liu, C., Yan, X., Qin, X.: An optimized linear skewing interleave scheme for on-chip multi-access memory systems. In: Proceedings of the 17th ACM Great Lakes symposium on VLSI, ser. GLSVLSI ’07. ACM, New York, pp. 8–13 (2007) (Online). doi:10.1145/1228784.1228793

  25. Liu, S., Chen, S., Chen, H., Guo, Y.: A novel parallel memory organization supporting multiple access types with matched memory modules. IEICE Electron. Express 9(6), 602–608 (2012)

    Article  Google Scholar 

  26. Tanskanen, J.K., Creutzburg, R., Niittylahti, J.T.: On design of parallel memory access schemes for video coding. J. VLSI Signal Process. Syst. 40(2), 215–237 (2005)

    Article  Google Scholar 

  27. Aho, E., Vanne, J., Hamalainen, T.: Parallel memory architecture for arbitrary stride accesses. In: Design and Diagnostics of Electronic Circuits and systems, IEEE, pp. 63–68 (2006)

  28. Kuzmanov, G., Gaydadjiev, G., Vassiliadis, S.: Multimedia rectangularly addressable memory. Multimed. IEEE Trans. 8(2), 315–322 (2006)

    Article  Google Scholar 

  29. Peng, J.-Y., Yan, X.-L., Li, D.-X., Chen, L.-Z.: A parallel memory architecture for video coding. J. Zhejiang Univ. Sci. A 9, 1644–1655 (2008)

    Article  Google Scholar 

  30. Vanne, J., Aho, E., Hamalainen, T., Kuusilinna, K.: A parallel memory system for variable block-size motion estimation algorithms. Circuits Syst. Video Technol. IEEE Trans. 18(4), 538–543 (2008)

    Article  Google Scholar 

  31. Lentaris, G., Reisis, D.: A graphics parallel memory organization exploiting request correlations. Comput. IEEE Trans. 59(6), 762–775 (2010)

    Article  MathSciNet  Google Scholar 

  32. Lo, W.-Y., Lun, D., Siu, W.-C., Wang, W., Song, J.: Improved SIMD architecture for high performance video processors. Circuits Syst. Video Technol. IEEE Trans. 21(12), 1769–1783 (2011)

    Article  Google Scholar 

  33. Beric, A., van Meerbergen, J., de Haan, G., Sethuraman, R.: Memory-centric video processing. Circuits Syst. Video Technol. IEEE Trans. 18(4), 439–452 (2008)

    Article  Google Scholar 

  34. Kelly, F., Kokaram, A.: Fast image interpolation for motion estimation using graphics hardware. In: Electronic Imaging. International Society for Optics and Photonics, pp. 184–194 (2004)

  35. Gupta, P., Korada, R.: Novel algorithm to reduce the complexity of quarter-pixel motion estimation. In: Electronic Imaging. International Society for Optics and Photonics, pp. 31–36 (2004)

  36. Tsung, P.K., Chen, W.Y., Ding, L.F., Tsai, C.Y., Chuang, T.D., Chen, L.G.: Single-iteration full-search fractional motion estimation for quad full HD H.264/AVC encoding. In: Multimedia and Expo, 2009. ICME 2009. IEEE International Conference, pp. 9–12 (2009)

  37. de Haan, G., Biezen, P.W.: Sub-pixel motion estimation with 3-D recursive search block-matching. Signal Process. Image Commun. 6(3), 229–239 (1994)

    Article  Google Scholar 

  38. Beric, A.: Video post processing architectures. Ph.D. dissertation, Eindhoven University of Technology, The Netherlands (2008)

  39. Jaspers, E., de With P.: Bandwidth reduction for video processing in consumer systems. In: Consumer Electronics, 2001. ICCE. International Conference, pp. 72–73 (2001)

  40. Jakovljevic, R., Beric, A.: A method for improving the efficiency of a two-level memory hierarchy. In: Signal Processing Systems, 2008. SiPS 2008. IEEE Workshop, pp. 37–42 (2008)

  41. Burns, G., Jacobs, M., Lindwer, M., Vandewiele, B.: Silicon Hive’s scalable and modular architecture template for high-performance multi-core systems. In: Proceedings of International Signal Processing Conference and Expo (2006)

  42. Pinto, C., Beric, A., Singh, S., Farfade, S.: HiveFlex-Video VSP1: video signal processing architecture for video coding and post-processing. In: Multimedia, 2006. ISM’06. Eighth IEEE International Symposium, pp. 493–500 (2006)

  43. Augusteijn, L.: The HiveCC Compiler for Massively Parallel ULIW Cores. In: Embedded Processor Forum. San Jose (2004)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Radomir Jakovljević.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jakovljević, R., Berić, A., van Dalen, E. et al. New access modes of parallel memory subsystem for sub-pixel motion estimation. J Real-Time Image Proc 15, 279–296 (2018). https://doi.org/10.1007/s11554-014-0481-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11554-014-0481-3

Keywords

Navigation