Abstract
We present a systematic methodology to support the design tradeoffs of array processors in several emerging issues, such as (1) high performance and high flexibility, (2) low cost, low power, (3) efficient memory usage, and (4) system-on-a-chip or the ease of system integration. This methodology is algebraic based, so it can cope with high-dimensional data dependence. The methodology consists of some transformation rules of data dependency graphs for facilitating flexible array designs. For example, two common partitioning approaches, LPGS and LSGP, could be unified under the methodology. It supports the design of high-speed and massively parallel processor arrays with efficient memory usage. More specifically, it leads to a novel systolic cache architecture comprising of shift registers only (cache without tags). To demonstrate how the methodology works, we have presented several systolic design examples based on the block-matching motion estimation algorithm (BMA). By multiprojecting a 4D DG of the BMA to 2D mesh, we can reconstruct several existing array processors. By multiprojecting a 6D DG of the BMA, a novel 2D systolic array can be derived that features significantly improved rates in data reusability (96%) and processor utilization (99%).
Similar content being viewed by others
References
D. Le Gall, “MPEG: A video compression standard for multimedia applications,” Communications of the ACM, Vol. 34 No.4, April 1991.
K. Guttag, R.J. Gove, and J.R.V. Aken, “A single-chip multiprocessor for multimedia: The MVP,” IEEE Computer Graphics and Applications, Vol. 11 No.6, pp. 53-64, Nov. 1992.
F. Sijstermans and J. van der Meer, “CD-1 full-motion video encoding on a parallel computer,” Communications of the ACM, Vol. 34 No.4, pp. 81-91, April 1991.
L. De Vos and M. Stegherr, “Parameterizable VLSI architectures for full-search block-matching algorithm,” IEEE Trans. on Circuits and Systems, Vol. 36 No.10, pp. 1309-1316, Oct. 1989.
T. Komarek and P. Pirsch, “Array architectures for block matching algorithms,” IEEE Trans. on Circuits and Systems, Vol. 36 No.10, pp. 1301-1308, Oct. 1989.
M.-T. Sun, “Algorithms and VLSI architectures for motion estimation,” VLSI Implementations for Image Communications, pp. 251-282, 1993.
B.-M. Wang, J.-C. Yen, and S. Chang, “Zero waiting-cycle hierarchical block matching algorithm and its array architectures,” IEEE Trans. on Circuits and Systems for Video Technology, Vol. 4 No.4, pp. 18-28, Feb. 1994.
L. De Vos, “VLSI-architectures for the hierarchical block-matching algorithm for HDTV applications,” SPIE Visual Communications and Image Processing, Vol. 1360, pp. 398-409, 1990.
C.-H. Hsieh and T.-P. Lin, “VLSI architecture for blockmatching motion estimation algorithm,” IEEE Trans. on Circuits and Systems for Video Technology, Vol. 2 No.2, pp. 169-175, June 1992.
P. Pirsch, N. Demassieux, and W. Gehrke, “VLSI architectures for video compression-A survey,” Proceedings of the IEEE, Vol. 83 No.2, pp. 220-246, Feb. 1995.
J. Baek, S. Nam, M. Lee, C. Oh, and K. Hwang, “A fast array architecture for block matching algorithm,” Proc. of IEEE Symposium on Circuits and Systems, Vol. 4, pp. 211-214, 1994.
S. Chang, J.-H. Hwang, and C.-W. Jen, “Scalable array architecture design for full search block matching,” IEEE Trans. on Circuits and Systems for Video Technology, Vol. 5 No.4, pp. 332- 343, Aug. 1995.
S.Y. Kung, VLSI Array Processors, Prentice Hall, Englewood Cliffs, NJ, 1988.
J. Teich and L. Thiele, “Partitioning of processor arrays: A piecewise regular approach,” INTEGRATION: The VLSI Journal, Vol. 14 No.3, pp. 297-332, 1993.
J. Teich, L. Thiele, and L. Zhang, “Partitioning processor arrays under resource constraints,” Journal of VLSI Signal Processing, Vol. 17 No.1, pp. 5-20, Sept. 1997.
K.-H. Zimmermann, “Linear mappings of n-dimensional uniform recurrences onto k-dimensional systolic array,” Journal of Signal Processing System for Signal, Image, and Video Technology, Vol. 12 No.2, pp. 187-202, May 1996.
H. Yeo and Y.-H. Hu, “A novel modular systolic array architecture for full-search block matching motion estimation,” IEEE Trans. on Circuits and Systems for Video Technology, Vol. 5 No.5, pp. 407-416, Oct. 1995.
K.-H. Zimmermann and W. Achtziger, “On time optimal implementation of uniform recurrences onto array processors via quadratic programming,” Journal of VLSI Signal Processing, Vol. 19 No.1, pp. 19-38, 1998.
K.-H. Zimmermann and W. Achtziger, “Finding space-time transformations for uniform recurrences via branching parametric linear programming,” Journal of VLSI Signal Processing, Vol. 15 No.3, pp. 259-274, 1997.
Y.-K. Chen and S.Y. Kung, “An operation placement and scheduling scheme for cache and communication localities in fine-grain parallel architectures,” Proc. of Int'l Symposium on Parallel Architectures, Algorithms and Networks, pp. 390-396, Dec. 1997.
N.L. Passos and E.H.-M. Sha, “Achieving full parallelism using multidimensional retiming,” IEEE Trans. on Parallel and Distributed Systems, Vol. 7 No.11, pp. 1150-1163, Nov. 1996.
G.-J. Li and B.W. Wah, “The design of optimal systolic array,” IEEE Trans. on Computer, Vol. 34 No.1, pp. 66-77, Jan. 1985.
W.F. Verhaegh, P.E. Lippens, E.H. Aarts, J.H. Korst, J.L. van Meerbergen, and A. van der Werf, “Improved force-directed scheduling in high-throughput digital signal processing,” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, Vol. 14 No.8, pp. 945-960, Aug. 1995.
Y. Wong and J.-M. Delosme, “Optimization of computation time for systolic array,” IEEE Trans. on Computer, Vol. 41 No.2, pp. 159-177, Feb. 1992.
Y.-T. Hwang and Y.-H. Hu, “A unified partitioning and scheduling scheme for mapping multi-stage regular iterative algorithms onto processor arrays,” Journal of VLSI Signal Processing Applications, Vol. 11, pp. 133-150, Oct. 1995.
K.-H. Zimmermann, “A unifying lattice-based approach for the partitioning of systolic arrays via LPGS and LSGP,” Journal of VLSI Signal Processing, Vol. 17 No.1, pp. 21-47, Sept. 1997.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Chen, YK., Kung, S. A Systolic Design Methodology with Application to Full-Search Block-Matching Architectures. The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology 19, 51–77 (1998). https://doi.org/10.1023/A:1008012332212
Published:
Issue Date:
DOI: https://doi.org/10.1023/A:1008012332212