Abstract
Maintaining local caches coherently in shared-memory multiprocessors results in significant power consumption. The customization methodology we propose exploits the fact that in embedded systems, important knowledge is available to the system designers regarding memory sharing between tasks. We demonstrate how the snoop-induced cache probings can be significantly reduced by identifying and exploiting in a deterministic way the shared memory regions between the processors. Snoop activity is enabled only for the accesses referring to known shared regions. The hardware support is not only cost efficient, but also software programmable, which allows for reprogrammability and customization across different tasks and applications.
- Barroso, L., Gharachorloo, K., McNamara, R., Nowatzyk, A., Qadeer, S., Sano, B., Smith, S., Stets, R., and Verghese, B. 2000. Piranha: A scalable architecture based on single-chip multiprocessing. In Proceedings of the International Symposium on Computer Architecture (ISCA). ACM Press, New York, 282--293. Google ScholarDigital Library
- Bashirullah, R., Liu, W., and Cavin, R. K. 2003. Low-Power design methodology for an on-chip bus with adaptive bandwidth capability. In Proceedings of the Design Automation Conference (DAC). ACM Press, New York, 628--633. Google ScholarDigital Library
- Berndl, M., Lhotak, O., Qian, F., Hendren, L., and Umanee, N. 2003. Points-To analysis using BDDS. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI). 103--114. Google ScholarDigital Library
- Binkert, N., Dreslinski, R., Hsu, L., Lim, K., Saidi, A., and Reinhardt, S. 2006. The m5 simulator: Modeling networked systems. IEEE Micro. 26, 4, 52--60. Google ScholarDigital Library
- Cantin, J. F., Lipasti, M. H., and Smith, J. E. 2005. Improving multiprocessor performance with coarse-grain coherence tracking. SIGARCH Comput. Archit. News 33, 2, 246--257. Google ScholarDigital Library
- Cekleov, M. and Dubois, M. 1997. Virtual-address caches. Part 1: Problems and solutions in uniprocessors. IEEE Micro. 17, 5 (Sept.), 64--71. Google ScholarDigital Library
- Cumming, P. 2003. The TI OMAP platform approach to SoC. In Winning the SOC Revolution. Kluwer Academic.Google Scholar
- Das, M. 2000. Unification-Based pointer analysis with directional assignments. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI), 35--46. Google ScholarDigital Library
- Ekman, M., Dahlgren, F., and Stenstrom, P. 2002. TLB and snoop energy-reduction using virtual caches in low-power chip-microprocessors. In Proceedings of the International Symposium on Low-Power Electronics and Design (ISLPED), 243--246. Google ScholarDigital Library
- Furber, S. B. 2000. ARM System-on-Chip Architecture. Addison-Wesley, Boston, MA. Google ScholarDigital Library
- Gonzalez, R. E. 2000. Xtensa: A configurable and extensible processor. IEEE Micro. 20, 2, 60--70. Google ScholarDigital Library
- Hind, M. 2001. Pointer analysis: Haven't we solved this problem yet? In ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE). Google ScholarDigital Library
- Intel Corporation. 2007. Intel XScale Microarchitecture. http://www.intel.com/design/intelxscale/316283.htm.Google Scholar
- Kathail, V., Aditya, S., Schreiber, R., Rau, B. R., Cronquist, D. C., and Sivaraman, M. 2002. Pico: Automatically designing custom computers. IEEE Comput. 35, 9, 39--47. Google ScholarDigital Library
- Landi, W. 1992. Undecidability of static analysis. ACM Lett. Program. Lang. Syst. 1, 4 (Dec.), 323--337. Google ScholarDigital Library
- Lenoski, D., Laudon, J., Gharachorloo, K., Gupta, A., and Hennessy, J. 1990. The directory-based cache-coherence protocol for the dash multiprocessor. In Proceedings of the International Symposium on Computer Architecture (ISCA). ACM Press, New York, 148--159. Google ScholarDigital Library
- Li, M.-L., Sasanka, R., Adve, S., Chen, Y.-K., and Debes, E. 2005. The ALPbench benchmark suite for complex multimedia applications. In Proceedings of the International Symposium on Workload Characterization, 34--45.Google Scholar
- Loghi, M., Letis, M., Benini, L., and Poncino, M. 2005. Exploring the energy efficiency of cache-coherence protocols in single-chip multi-processors. In Proceedings of the 15th Great Lakes Symposium on VLSI (GLSVLSI), 276--281. Google ScholarDigital Library
- Lyonnard, D., Yoo, S., Baghdadi, A., and Jerraya, A. 2001. Automatic generation of application-specific architectures for heterogeneous multiprocessor system-on-chip. In Proceedings of the Design Automation Conference (DAC). ACM Press, New York, 518--523. Google ScholarDigital Library
- Martin, M. K., Hill, M. D., and Wood, D. A. 2003. Token coherence: Decoupling performance and correctness. In Proceedings of the International Symposium on Computer Architecture (ISCA). ACM Press, New York, 182--193. Google ScholarDigital Library
- Martin, M. M. K., Sorin, D. J., Hill, M. D., and Wood, D. A. 2002. Bandwidth adaptive snooping. In Proceedings of the Intrnational Symposium on High-Performance Computer Architecture (HPCA), 251--262. Google ScholarDigital Library
- Moshovos, A. 2005. Regionscout: Exploiting coarse grain sharing in snoop-based coherence. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA). IEEE Computer Society, Washington, DC, 234--245. Google ScholarDigital Library
- Moshovos, A., Memik, G., Choudhary, A., and Falsafi, B. 2001. Jetty: Filtering snoops for reduced energy consumption in SMP servers. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA). IEEE Computer Society, Washington, DC, 85--96. Google ScholarDigital Library
- Nilsson, J., Landin, A., and Stenstrom, P. 2003. The coherence predictor cache: A resource-efficient and accurate coherence prediction infrastructure. In Proceedings of the International Symposium on Parallel and Distributed Processing. IEEE Computer Society, Washington, DC, 10--17. Google ScholarDigital Library
- Ramalingam, G. 1994. The undecidability of aliasing. ACM Trans. Program. Lang. Syst. 16, 5, 1467--1471. Google ScholarDigital Library
- Rowen, C. 2004. Engineering the Complex SOC. Fast, Flexible Design with Configurable Processors. Prentice Hall, NJ.Google Scholar
- Rugina, R. and Rinard, M. 1999. Pointer analysis for multithreaded programs. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation (PLDI) 34, 5, 77--90. Google ScholarDigital Library
- Salcianu, A. and Rinard, M. 2001. Pointer and escape analysis for multithreaded programs. In Proceedings of the Symposium on Principles and Practices of Parallel Programming (PPoPP), 12--23. Google ScholarDigital Library
- Saldanha, C. and Lipasti, M. 2001. Power efficient cache-coherence. In Workshop on Memory Performance Issues.Google Scholar
- Sangiovanni-Vincentelli, A. and Martin, G. 2001. Platform-Based design and software design methodology for embeddedsystems. IEEE Des. Test Comput. 18, 23--33. Google ScholarDigital Library
- Singh, J. P., Weber, W.-D., and Gupta, A. 1992. Splash: Stanford parallel applications for shared-memory. SIGARCH Comput. Archit. News 20, 1, 5--44. Google ScholarDigital Library
- Tarjan, D., Thoziyoor, S., and Jouppi, N. 2006. Cacti 4.0: An integrated cache timing, power and area model. Tech. Rep., HP Laboratories, Palo Alto, CA. June.Google Scholar
- Wenisch, T. F., Somogyi, S., Hardavellas, N., Kim, J., Ailamaki, A., and Falsafi, B. 2005. Temporal streaming of shared memory. In Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA). IEEE Computer Society, Washington, DC, 222--233. Google ScholarDigital Library
- Wolf, W. 2001. Computers as Components: Principles of Embedded Computing Systems Design. Morgan Kaufmann, San Francisco, CA. Google ScholarDigital Library
- Wolf, W. 2004. The future of multiprocessor systems-on-chips. In Proceedings of the Design Automation Conference (DAC), 681--685. Google ScholarDigital Library
Index Terms
- Application-aware snoop filtering for low-power cache coherence in embedded multiprocessors
Recommendations
The locality-aware adaptive cache coherence protocol
ICSA '13Next generation multicore applications will process massive amounts of data with significant sharing. Data movement and management impacts memory access latency and consumes power. Therefore, harnessing data locality is of fundamental importance in ...
An efficient cache design for scalable glueless shared-memory multiprocessors
CF '06: Proceedings of the 3rd conference on Computing frontiersTraditionally, cache coherence in large-scale shared-memory multiprocessors has been ensured by means of a distributed directory structure stored in main memory. In this way, the access to main memory to recover the sharing status of the block is ...
Boosting performance of directory-based cache coherence protocols with coherence bypass at subpage granularity and a novel on-chip page table
CF '16: Proceedings of the ACM International Conference on Computing FrontiersChip multiprocessors (CMPs) require effective cache coherence protocols as well as fast virtual-to-physical address translation mechanisms for high performance. Directory-based cache coherence protocols are the state-of-the-art approaches in many-core ...
Comments