ABSTRACT
The high arithmetic rates of media processing applications require architectures with tens to hundreds of functional units, multiple register files, and explicit interconnect between functional units and register files. Communication scheduling enables scheduling to these emerging architectures, including those that use shared buses and register file ports. Scheduling to these shared interconnect architectures is difficult because it requires simultaneously allocating functional units to operations and buses and register file ports to the communications between operations. Prior VLIW scheduling algorithms are limited to clustered register file architectures with no shared buses or register file ports. Communication scheduling extends the range of target architectures by making each communication explicit and decomposing it into three components: a write stub, zero or more copy operations, and a read stub. Communication scheduling allows media processing kernels to achieve 98% of the performance of a central register file architecture on a distributed register file architecture with only 9% of the area, 6% of the power consumption, and 37% of the access delay, and 120% of the performance of a clustered register file architecture on a distributed register file architecture with 56% of the area and 50% of the power consumption.
- 1.Capitanio, A., Dutt, N., and Nicolau, A. "Partitioned register files for VLIWs: A preliminary analysis of trade-offs." Proceedings of the 25th Annual International Symposium on Microarchitecture, Dec., 1992, pp. 292-300. Google ScholarDigital Library
- 2.Colwell, R., Hall, W., Joshi, C., Papworth, D., Rodman, P., and Tornes, J. "Architecture and implementation of a VLIW supercomputer." Proceedings in Supercomputing, Nov., 1990, pp. 910-919. Google ScholarDigital Library
- 3.Dehnert, J. and Towle, R. "Compiling for the Cydra 5." Journal of Supercomputing, Jan., 1993, 182-227. Google ScholarDigital Library
- 4.Desoli, G. "Instruction assignment for clustered VLIW DSP compilers: A new approach." Technical Report HPL- 98- 13, Hewlett-Packard Laboratories, Feb., 1998.Google Scholar
- 5.Diefendorff, K. and Dubey, P. "How multimedia workloads will change processor design." Computer, Sept., 1997, pp. 43-45. Google ScholarDigital Library
- 6.Ellis, J., Bulldog: A compiler for VLIW architectures. Cambridge, MA: MIT Press, 1986. Google ScholarDigital Library
- 7.Fernandes, M., Llosa, J., and Topham, N., "Distributed modulo scheduling." Proceedings of the 5th Annual International Conference on High Performance Computer Architecture, Jan., 1999, pp. 130-134. Google ScholarDigital Library
- 8.Grossman, J. and Dally, W. "Point sample rendering." Proceedings of the 9th Eurographics Workshop on Rendering, June, 1998, pp. 181-192.Google Scholar
- 9.Lam, M. "Software pipelining: An effective scheduling technique for VLIW machines." Proceedings of the Conference on Programming Language Design and Implementation, June, 1988, pp. 318-328. Google ScholarDigital Library
- 10.Lowney, P., Freudenberger, S., Karzes, T., Lichtenstein, W., Nix, R., O'Donnell, J., and Ruttenberg, J. "The Multiflow trace scheduling compiler." Journal of Supercomputing, Jan., 1993, pp. 51-142. Google ScholarDigital Library
- 11.Mangione-Smith, W., Abraham, S., and Davidson, E. "Register requirements of pipelined processors." Proceedings of the International Conference on Supercomputing, July, 1992, pp. 260-271. Google ScholarDigital Library
- 12.Nystrom, E., and Eichenberger, A. "Effective cluster assignment for modulo scheduling." Proceedings of the 31st Annual International Symposium on Microarchitecture, Dec., 1998, pp. 103 - 114. Google ScholarDigital Library
- 13.Ozer, E., Banerjia, S., and Conte, T. "Unified assign and schedule: A new approach to scheduling for clustered register file microarchitectures." Proceedings of the 31st Annual International Symposium on Microarchitecture, Dec., 1998, pp. 308-315. Google ScholarDigital Library
- 14.Rau, B., Glaeser, C., and Picard, R., "Efficient code generation for horizontal architectures: Compiler techniques and architectural support." Proceedings of the International Symposium on Computer Architecture, July, 1982, pp. 131- 139. Google ScholarDigital Library
- 15.Rixner, S., Dally, W. J., Khailany, B., Mattson, P., Kapasi, U. J., and Owens, J. D. "Register organization for media processing", 6th International Symposium on High-Performance Computer Architecture, Jan., 2000, pp. 375-386.Google Scholar
- 16.Rixner, S., Dally, W. J., Kapasi, U. J., Khailany, B., Lopez- Lagunas, A., Mattson, P., and Owens, J. D. "A bandwidthefficient architecture for media processing", Proceedings of the 31st Annual International Symposium on Microarchitecture, Dec., 1998, pp. 3-13. Google ScholarDigital Library
- 17.Stotzer, E. and Leiss, E., "Modulo scheduling for the TMS320C6x VLIW DSP architecture," Proceedings of the ACM SIGPLAN 1999 Workshop on Languages, Compilers, and Tools for Embedded Systems, May, 1999, pp. 28-34. Google ScholarDigital Library
Index Terms
- Communication scheduling
Recommendations
Communication scheduling
The high arithmetic rates of media processing applications require architectures with tens to hundreds of functional units, multiple register files, and explicit interconnect between functional units and register files. Communication scheduling enables ...
Communication scheduling
Special Issue: Proceedings of the ninth international conference on Architectural support for programming languages and operating systems (ASPLOS '00)The high arithmetic rates of media processing applications require architectures with tens to hundreds of functional units, multiple register files, and explicit interconnect between functional units and register files. Communication scheduling enables ...
Communication scheduling
The high arithmetic rates of media processing applications require architectures with tens to hundreds of functional units, multiple register files, and explicit interconnect between functional units and register files. Communication scheduling enables ...
Comments