Abstract
In modern multimedia applications, memory bottleneck can be alleviated with special stride data accesses. Data elements in stride access can be retrieved in parallel with parallel memories, in which the idea is to increase memory bandwidth with several memory modules working in parallel and feed the processor with only necessary data. Arbitrary stride access capability with interleaved memories is described in previous research where the skewing scheme is changed at run time according to the currently used stride. This paper presents the improved schemes which are adapted to parallel memories. The proposed novel parallel memory implementation allows conflict free accesses with all the constant strides which has not been possible in prior application specific parallel memories. Moreover, the possible access locations are unrestricted and the accessed data element count equals to the number of memory modules. Timing and area estimates are given for Altera Stratix FPGA and 0.18 micrometer CMOS process with memory module count from 2 to 32. The FPGA results show 129 MHz clock frequency for a system with 16 memory modules when read and write latencies are 3 and 2 clock cycles, respectively. The complexity of the proposed system is shown to be a trade-off between application specific and highly configurable parallel memory system.
Similar content being viewed by others
References
H.-J. Stolberg, M. Berekovic, S. Moch, L. Friebe, M.B. Kulaczewski, S. Flügel, H. Klußmann, A. Dehnhardt and P. Pirsch, “HiBRID-SoC: A Multi-Core SoC Architecture for Multimedia,” J. VLSI Signal Process., vol. 41, no. 1, 2005, pp. 9–20.
P. Ranganathan, S. Adve and N.P. Jouppi, “Performance of Image and Video Processing with General-purpose Processors and Media ISA Extensions,” in Proc. Int. Symp. Computer Architecture, Atlanta, GA, USA, 1999, pp. 124–135, May.
N. Slingerland and A. J. Smith, “Measuring the Performance of Multimedia Instruction Sets,” IEEE Trans. Comput., vol. 51, no. 11, 2002, pp. 1317–1332.
J.L. Hennessy and D.A. Patterson, Computer Architecture: A Quantitative Approach, 3rd ed., Morgan Kaufman Publishers, 2003.
J. Takala and T. Järvinen, “Stride Permutation Access in Interleaved Memory Systems,” in Domain-specific Multiprocessors—Systems, Architectures, Modeling, and Simulation, S. S. Bhattacharyya, E. F. Deprettere, and J. Teich (Eds.), Marcel Dekker, 2004, pp. 63–84.
E. Aho, J. Vanne, K. Kuusilinna and T.D. Hämäläinen, “Address Computation in Configurable Parallel Memory Architecture,” IEICE Trans. Inf. Syst., vol. E87-D, no. 7, 2004, pp. 1674–1681.
P. Budnik and D.J. Kuck, “The Organization and Use of Parallel Memories,” IEEE Trans. Comput., vol. C-20, no. 12, 1971, pp. 1566–1569.
S. Chen, A. Postula, and L. Jozwiak, “Synthesis of XOR Storage Schemes with Different Cost for Minimization of Memory Contention,” in Proc. Euromicro Conf., Milan, Italy, 1999, pp. 170–177, Sep.
G. Kuzmanov, G. Gaydadjiev, and S. Vassiliadis, “Multimedia Rectangularly Addressable Memory,” IEEE Trans. Multimedia, vol. 8, no. 2, 2006, pp. 315–322.
A. Norton and E. Melton, “A Class of Boolean Linear Transformations for Conflict-Free Power-of-Two Stride Access,” in Proc. Int. Conf. Parallel Processing, University Park, PA, USA, 1987, pp. 247–254, Aug.
M. Valero, T. Lang, M. Peiron, and E. Ayguadé, “Conflict-free Access for Streams in Multimodule Memories,” IEEE Trans. Comput., vol. 44, no. 5, 1995, pp. 634–646.
C. Verdier, E. Boutillon, A. Lafage, and A. Demeure, “Access and Alignment of Arrays for a Bidimensional Parallel Memory,” in Proc. Int. Conf. Application Specific Array Processors, San Francisco, CA, USA, 1994, pp. 346–356, Aug.
R. S. Katti, “Nonprime Memory Systems and Error Correction in Address Translation,” IEEE Trans. Comput., vol. 46, no. 1, 1997, pp. 75–79.
T. Järvinen, P. Salmela, T. Sipilä, and J. Takala, “Systematic Approach for Path Metric Access in Viterbi Decoders,” IEEE Trans. Commun., vol. 53, no. 5, 2005, pp. 755–759.
D. T. Harper III and D. A. Linebarger, “Conflict-free Vector Access Using a Dynamic Storage Scheme,” IEEE Trans. Comput., vol. 40, no. 3, 1991, pp. 276–283.
D. T. Harper III, “Increased Memory Performance During Vector Accesses Through the Use of Linear Address Transformations,” IEEE Trans. Comput., vol. 41, no. 2, 1992, pp. 227–230.
E. Aho, J. Vanne, T.D. Hämäläinen and K. Kuusilinna, “Block-level Parallel Processing for Scaling Evenly Divisible Images,” IEEE Trans. Circuits Syst. I, vol. 52, no. 12, 2005, pp. 2717–2725.
E. Aho, J. Vanne and T.D. Hämäläinen, “Parallel Memory Architecture for Arbitrary Stride Accesses,” in Proc. IEEE Workshop Design and Diagnostics of Electronic Circuits and Systems, Prague, Czech Republic, 2006, pp. 65–70, Apr.
E. Aho, J. Vanne and T.D. Hämäläinen, “Parallel Memory Implementation for Arbitrary Stride Accesses,” in Proc. Embedded Computer Systems: Architectures, Modeling, and Simulation Conference, Samos, Greece, 2006, pp. 1–6, July.
P. Pirsch, C. Reuter, J.P. Wittenburg, M.B. Kulaczewski and H.-J. Stolberg, “Architecture Concepts for Multimedia Signal Processing,” J. VLSI Signal Process., vol. 29, no. 3, 2001, pp. 157–165.
P. Faraboschi, G. Desoli and J.A. Fisher, “The Latest Word in Digital and Media Processing,” IEEE Signal Process. Mag., vol. 15, no. 2, 1998, pp. 59–85.
D. Talla, L.K. John, V. Lapinskii and B.L. Evans, “Evaluating Signal Processing and Multimedia Applications on SIMD, VLIW and Superscalar Architectures,” in Proc. Int. Conf. Computer Design, Austin, TX, USA, 2000, pp. 163–172, Sep.
D. Cheresiz, B. Juurlink, S. Vassiliadis and H.A.G. Wijshoff, “The CSI Multimedia Architecture,” IEEE Trans. VLSI Syst., vol. 13, no. 1, 2005, pp. 1–13.
A. Peleg and U. Weiser, “MMX Technology Extension to the Intel Architecture,” IEEE MICRO, vol. 16, no. 4, 1996, pp. 42–50.
S. Thakkar and T. Huff, “Internet Streaming SIMD Extensions,” IEEE Computer, vol. 32, no. 12, 1999, pp. 26–34.
D. Boggs, A. Baktha, J. Hawkins, D.T. Marr, J. A. Miller, P. Roussel, R. Singhal, B. Toll and K.S. Venkatraman, “The Microarchitecture of the Intel® Pentium® 4 Processor on 90 nm Technology,” Intel Technol. J., vol. 8, no. 1, 2004, pp. 1–17.
S. Oberman, G. Favor and F. Weber, “AMD 3DNow! Technology: Architecture and Implementations,” IEEE MICRO, vol. 19, no. 2, 1999, pp. 37–48.
M. Tremblay, J.M. O’Connor, V. Narayanan and L. He, “VIS Speeds New Media Processing,” IEEE MICRO, vol. 16, no. 4, 1996, pp. 10–20.
D.A. Carlson, R.W. Castelino and R.O. Mueller, “Multimedia Extensions for a 550-MHz RISC Microprocessor,” IEEE J. Solid-State Circuits, vol. 32, no. 11, 1997, pp. 1618–1624.
R.B. Lee, “Subword Parallelism with MAX-2,” IEEE MICRO, vol. 16, no. 4, 1996, pp. 51–59.
K. Diefendorff, P.K. Dubey, R. Hochsprung and H. Scale, “AltiVec Extension to PowerPC Accelerates Media Processing,” IEEE MICRO, vol. 20, no. 2, 2000, pp. 85–95.
J. Fridman and Z. Greenfield, “The TigerSHARC DSP Architecture,” IEEE MICRO, vol. 20, no. 1, 2000, pp. 66–76.
Texas Instruments, Inc., TMS320C64x Technical Overview, Texas Instruments, Inc., 2001, Jan.
C. Basoglu, W. Lee and J. O’Donnell, “The Equator MAP-CA™ DSP: An End-to-End Broadband Signal Processor™ VLIW,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 8, 2002, pp. 646–659.
J.-W. van de Waerdt, S. Vassiliadis, S. Das, S. Mirolo, C. Yen, B. Zhong, C. Basto, J.-P. van Itegem, D. Amirtharaj, K. Kalra, P. Rodriguez, and H. van Antwerpen, “The TM3270 Media-Processor,” in Proc. IEEE/ACM Int. Symp. Microarchitecture, Barcelona, Spain, 2005, pp. 331–342, Nov.
V. Lappalainen, T.D. Hämäläinen and P. Liuha, “Overview of Research Efforts on Media ISA Extensions and Their Usage in Video Coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 12, no. 8, 2002, pp. 660–670.
J. Corbal, M. Valero and R. Espasa, “Exploiting a New Level of DLP in Multimedia Applications,” in Proc. Int. Symp. Microarchitecture, Haifa, Israel, 1999, pp. 72–79, Nov.
L. Zhang, Z. Fang, M. Parker, B.K. Mathew, L. Schaelicke, J.B. Carter, W. C. Hsieh and S.A. McKee, “The Impulse Memory Controller,” IEEE Trans. Comput., vol. 50, no. 11, 2001, pp. 1117–1132.
S. A. McKee, W. A. Wulf, J. H. Aylor, R. H. Klenke, M. H. Salinas, S. I. Hong, and D. A. B. Weikle, “Dynamic Access Ordering for Streamed Computations,” IEEE Trans. Comput., vol. 49, no. 11, 2000, pp. 1255–1271.
B. Khailany, W. J. Dally, U. J. Kapasi, P. Mattson, J. Namkoong, J. D. Owens, B. Towles, A. Chang, and Scott Rixner, “Imagine: Media Processing with Streams,” IEEE MICRO, vol. 21, no. 2, 2001, pp. 35–46.
C. E. Kozyrakis and D. A. Patterson, “Scalable Vector Processors for Embedded Systems,” IEEE MICRO, vol. 23, no. 6, 2003, pp. 36–45.
A. Seznec and J. Lenfant, “Interleaved Parallel Schemes,” IEEE Trans. Parallel Distrib. Syst., vol. 5, no. 12, 1994, pp. 1329–1334.
J.M. Frailong, W. Jalby and J. Lenfant, “XOR-Schemes: A Flexible Data Organization in Parallel Memories,” in Proc. Int’l Conf. Parallel Processing, Washington, DC, USA, 1985, pp. 276–283, Aug.
K. Kim and V.K. Prasanna Kumar, “Parallel Memory Systems for Image Processing,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, San Diego, CA, USA, 1989, pp. 654–659, June.
D.H. Lawrie, “Access and Alignment of Data in an Array Processor,” IEEE Trans. Comput., vol. C-24, no. 12, 1975, pp. 1145–1155.
H.A.G. Wijshoff and J. van Leeuwen, “On Linear Skewing Schemes and d-ordered Vectors,” IEEE Trans. Comput., vol. C-36, no. 2, 1987, pp. 233–239.
D.-L. Lee, “On Access and Alignment of Data in a Parallel Processor,” Inf. Process. Lett., vol. 33, no. 1, 1989, pp. 11–14.
D.T. Harper III and D.A. Linebarger, “Dynamic Address Mapping for Conflict-Free Vector Access,” U.S. Patent 4 918 600, Apr 17, 1990.
S. Dutta, W. Wolf and A. Wolfe, “A Methodology to Evaluate Memory Architecture Design Tradeoffs for Video Signal Processors,” IEEE Trans. Circuits Syst. Video Technol., vol. 8, no. 1, 1998, pp. 36–53.
Altera, Stratix Device Handbook, vol. 1, version 3.2, Altera, 2005. Jan.
Altera, Nios 3.0 CPU Data Sheet, version 2.2, Altera, 2004, Oct.
E. Salminen, A. Kulmala and T.D. Hämäläinen, “HIBI-based Multiprocessor SoC on FPGA,” in Proc. IEEE Int’l Symp. Circuits Syst., Kobe, Japan, 2005, pp. 3351–3354, May.
E. Aho, J. Vanne, T.D. Hämäläinen and K. Kuusilinna, “Configurable Implementation of Parallel Memory Based Real-time Video Downscaler,” Microprocess. Microsyst., vol. 31, no. 5, 2007, pp. 283–292.
L. Li, S. Goto and T. Ikenaga, “An Efficient Deblocking Filter Architecture with 2-Dimensional Parallel Memory for H.264/AVC,” in Proc. Asia and South Pacific Design Automation Conf., Shanghai, China, 2005, pp. 623–626, Jan.
J. Vanne, E. Aho, T.D. Hämäläinen and K. Kuusilinna, “A Parallel Memory System for Variable Block Size Motion Estimation Algorithms,” IEEE Trans. Circuits Syst. Video Technol. (in press).
T.H. Morrin and D.C. van Voorhis, “Method and Apparatus for Accessing Horizontal Sequences and Rectangular Sub-Arrays from an Array Stored in a Modified Word Organized Random Access Memory System,” U.S. Patent 3 938 102, Feb 10, 1976.
J.W. Park, “An Efficient Memory System for Image Processing,” IEEE Trans. Comput., vol. C-35, no. 7, 1986, pp. 669–674.
J. K. Tanskanen, T. Sihvo, and J. Niittylahti, “Byte and Modulo Addressable Parallel Memory Architecture for Video Coding,” IEEE Trans. Circuits Syst. Video Technol., vol. 14, no. 11, 2004, pp. 1270–1276.
J.K. Tanskanen, R. Creutzburg, and J.T. Niittylahti, “On Design of Parallel Memory Access Schemes for Video Coding,” J. VLSI Signal Process., vol. 40, no. 2, 2005, pp. 215–237.
J.K. Tanskanen and J.T. Niittylahti, “Scalable Parallel Memory Architectures for Video Coding,” J. VLSI Signal Process., vol. 38, no. 2, 2004, pp. 173–199.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Aho, E., Vanne, J. & HÄmÄlÄinen, T.D. Configurable Data Memory for Multimedia Processing. J Sign Process Syst Sign Image 50, 231–249 (2008). https://doi.org/10.1007/s11265-007-0126-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-007-0126-x