Export Citations
If you look around you will find that all computer systems, from your portable devices to the strongest supercomputers, are heterogeneous in nature. The most obvious heterogeneity is the existence of computing nodes of different capabilities (e.g. multicore, GPUs, FPGAs, ...). But there are also other heterogeneity factors that exist in computing systems, like the memory system components, interconnection, etc. The main reason for these different types of heterogeneity is to have good performance with power efficiency.
Heterogeneous computing results in both challenges and opportunities. This book discusses both. It shows that we need to deal with these challenges at all levels of the computing stack: from algorithms all the way to process technology. We discuss the topic of heterogeneous computing from different angles: hardware challenges, current hardware state-of-the-art, software issues, how to make the best use of the current heterogeneous systems, and what lies ahead.
The aim of this book is to introduce the big picture of heterogeneous computing. Whether you are a hardware designer or a software developer, you need to know how the pieces of the puzzle fit together. The main goal is to bring researchers and engineers to the forefront of the research frontier in the new era that started a few years ago and is expected to continue for decades. We believe that academics, researchers, practitioners, and students will benefit from this book and will be prepared to tackle the big wave of heterogeneous computing that is here to stay.
- A. Abella and A. Gonzalez. June 2006. Heterogeneous way-size cache. In International Conference on Supercomputing (ICS), pp. 239--248. 33Google Scholar
- A. Abualsamid. June 1998. PGP disk's security takes a bite out of crime. Network Computing, 9(10): 54. 47Google Scholar
Digital Library
- O. Aciiçmez. 2007. Yet another microarchitectural attack: Exploiting i-cache. In Proceedings of the 2007 ACM Workshop on Computer Security Architecture, CSAW '07, pp. 11--18. ACM, New York. 49Google Scholar
Digital Library
- O. Aciiçmez, B. B. Brumley, and P. Grabher. 2010. New results on instruction cache attacks. In Proceedings of the 12th International Conference on Cryptographic Hardware and Embedded Systems, CHES '10, pp. 110--124. Springer-Verlag, Berlin, Heidelberg. 49Google Scholar
- S. Aga, S. Jeloka, A. Subramaniyan, S. Narayanasamy, D. Blaauw, and R. Das. February 2017. Compute caches. In 23rd IEEE Symposium on High Performance Computer Architecture (HPCA). 91Google Scholar
- A. Agarwal et al. 1988. An evaluation of directory schemes for cache coherence. In 25 Years ISCA: Retrospectives and Reprints, pp. 353--362. 34Google Scholar
Cross Ref
- A. Agarwal et al. 2004. Evaluating the raw microprocessor: An exposed-wire-delay architecture for ILP and streams. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA), pp. 2--13. 34Google Scholar
- A. Agarwal and S. D. Pudar. 1993. Column-associative caches: A technique for reducing the miss rate of direct-mapped caches. In Proceedings of the 20th International Symposium on Computer Architecture, pp. 179--190. 33Google Scholar
- H. Al-Zoubi, A. Milenkovic, and M. Milenkovic. 2004. Performance evaluation of cache replacement policies for the SPEC CPU2000 benchmark suite. In Proceedings of the 42nd ACM Southeast Conference, pp. 267--272. 35Google Scholar
- D. H. Albonesi. 2002. Selective cache ways: On-demand cache resource allocation. Journal of Instruction-Level Parallelism, pp. 248--259. 33, 34Google Scholar
- J. Allred, S. Roy, and K. Chakraborty. 2012. Designing for dark silicon: A methodological perspective on energy efficient systems. In Proceedings of the 2012 ACM/IEEE International Symposium on Low Power Electronics and Design, ISLPED '12, pp. 255--260. ACM, New York. 3Google Scholar
- J. Archibald and J.-L. Baer. May 1986. Cache coherence protocols: Evaluation using a multiprocessor simulation model. ACM Transactions on Computer Systems, pp. 273--298. 34Google Scholar
Digital Library
- E. Azarkhish, D. Rossi, I. Loi, and L. Benini. 2016. Design and evaluation of a processing-in-memory architecture for the smart memory cube. In Proceedings of the 29th International Conference on Architecture of Computing Systems - ARCS 2016, Volume 9637, pp. 19--31. Springer-Verlag, New York. 91Google Scholar
- R. Balasubramonian, J. Chang, T. Manning, J. H. Moreno, R. Murphy, R. Nair, and S. Swanson. July 2014. Near-data processing: Insights from a micro-46 workshop. IEE MICRO Magazine, 34(4): 36--42. 91Google Scholar
Cross Ref
- B. Beckmann and D. Wood. December 2004. Managing wire delay in large chip-multiprocessor caches. In Proceedings of the 37th International Annual Symposium on Microarchitecture (Micro-37), pp. 319--330. 34Google Scholar
- L. Benini and G. DeMicheli. January 2002. Networks on chips: A new SoC paradigm. In IEEE Computer, pp. 70--78. 34Google Scholar
- M. T. Billingsley III, B. R. Tibbitts, and A. D. George. 2010. Improving UPC productivity via integrated development tools. In Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, PGAS '10, pp. 8:1--8:9. ACM, New York. 68Google Scholar
- S. Borkar, P. Dubey, K. Kahn, D. Kuck, H. Mulder, S. Pawlowski, and J. Rattner. 2006. Platform 2015: Intel processsor and platform evolution for the next decade. White paper, Intel Corporation. 35Google Scholar
- A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, K. Hsieh, K. T. Malladi, H. Zheng, and O. Mutlu. January 2017. Lazypim: An efficient cache coherence mechanism for processing-in-memory. IEEE Computer Architecture Letters, 16(1): 46--50. 91Google Scholar
Digital Library
- P. Bose. February 2013. Is dark silicon real? Technical perspective. Communications of the ACM, 56(2): 92--92. 3Google Scholar
Digital Library
- J. Boukhobza, S. Rubini, R. Chen, and Z. Shao. November 2017. Emerging NVM: A survey on architectural integration and research challenges. ACM Transactions on Design Automation of Electronic Systems, 23(2): 14:1--14:32. 7Google Scholar
Digital Library
- R. K. Braithwaite, W.-c. Feng, and P. S. McCormick. 2012. Automatic NUMA characterization using cbench. In Proceedings of the 3rd ACM/SPEC International Conference on Performance Engineering, ICPE '12, pp. 295--298. ACM, New York. 6Google Scholar
- S. Bratus, N. D'Cunha, E. Sparks, and S. W. Smith. 2008. Toctou, traps, and trusted computing. In Proceedings of the 1st International Conference on Trusted Computing and Trust in Information Technologies: Trusted Computing---Challenges and Applications, pp. 14--32. Springer-Verlag, Berlin, Heidelberg. 47Google Scholar
- Broadcom Corporation, 2006. BCM1455: Quad-core 64-bit MIPS processor. http://www.broadcom.com/collateral/pb/1455-PB04-R.pdf. 35Google Scholar
- B. Calder, D. Grunwald, and J. Emer. 1996. Predictive sequential associative cache. In Proceedings of the 2nd International Symposium on High Performance Computer Architecture, pp. 244--253. 33, 34Google Scholar
- B. Calder, C. Krintz, S. John, and T. Austin. 1998. Cache-conscious data placement. In Proceedings of the International Conference on Architecture Support for Programming Languages and OperatingSystem (ASPLOS), pp. 139--149. 33Google Scholar
- F. Cantonnet, Y. Yao, M. Zahran, and T. El-Ghazawi. April 2004. Productivity analysis of the UPC language. In 3rd International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS). 68Google Scholar
- A. M. Caulfield, E. S. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman, S. Heil, et al. 2016. A cloud-scale acceleration architecture. In 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-49, pp. 7:1--7:13. IEEE Press, Piscataway, NJ. http://dl.acm.org/citation.cfm?id=3195638.3195647. 23Google Scholar
Digital Library
- S. Chandrasekaran and G. Juckeland, eds. 2018. OpenACC for Programmers: Concepts and Strategies. Addison-Wesley, Boston, MA'. 83Google Scholar
- J. Chang and G. S. Sohi. 2006. Cooperative caching for chip multiprocessors. In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA), pp. 264--276. 34Google Scholar
- L. Cheng, N. Muralimanohar, K. Ramani, R. Balasubramonian, and J. Carter. June 2006. Interconnect-aware coherence protocols for chip multiprocessors. In Proceedings of the 33rd IEEE/ACM International Symposium on Computer Architecture, pp. 339--351. 34Google Scholar
- B. Childers, J. W. Davidson, and M. L. Soffa. 2003. Continuous compilation: A new approach to aggressive and adaptive code transformation. In Proceedings of the 17th International Symposium on Parallel and Distributed Processing, IPDPS '03, pp. 205--214. 71Google Scholar
- Z. Chishti, M. D. Powell, and T. N. Vijaykumar. 2003. Distance associativity for high-performance energy-efficient non-uniform cache architectures. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 55--66. 6Google Scholar
- J. Clark, S. Leblanc, and S. Knight. 2009. Hardware Trojan horse device based on unintended USB channels. In Proceedings of the 2009 Third International Conference on Network and System Security, NSS '09, pp. 1--8. IEEE Computer Society, Washington, DC. 49Google Scholar
- J. Coburn, S. Ravi, A. Raghunathan, and S. Chakradhar. 2005. SECA: Security-enhanced communication architecture. In Proceedings of the 2005 International Conference on Compilers, Architectures and Synthesis for Embedded Systems, CASES '05, pp. 78--89. 47Google Scholar
- C. Cowan, C. Pu, D. Maier, H. Hinton, and J. Walpole. January 1998. StackGuard: Automatic adaptive detection and prevention of buffer-overflow attacks. In Proceedings of the 7th USENIX Security Symposium, pp. 63--78. 48Google Scholar
- W. Dally and B. Towles. 2001. Route packets, not wires: On-chip interconnection networks. In Proceedings of the 38th Conference on Design Automation, pp.684--689. 34Google Scholar
- R. H. Dennard, F. H. Gaensslen, V. L. Rideout, E. Bassous, and A. R. LeBlanc. October 1974. Design of ion-implanted MOSFET's with very small physical dimensions. IEEE Journal of Solid-State Circuits 9(5): 256--268. 2Google Scholar
Digital Library
- R. V. der Pas, E. Stotzer, and C. Terboven. 2017. Using OpenMP---the Next Step. MIT Press, Cambridge, MA. 77Google Scholar
- A. S. Dhodapkar and J. E. Smith. 2002. Managing multi-configuration hardware via dynamic working set analysis. In Proceedings of the 17th International Symposium on Computer Architecture, pp. 233--244. 33Google Scholar
- G. Di Crescenzo. 2005. Security of erasable memories against adaptive adversaries. In Proceedings of the edings of the2 005 ACM Workshop on Storage Security and Survivability, StorageSS '05, pp. 115--122. 47Google Scholar
Digital Library
- S. J. Eggers and R. H. Katz. 1989. Evaluating the performance of four snooping cache coherence protocols. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 2--15. 34Google Scholar
- M. Ekman, F. Dahlgren, and P. Stenstrom. August 2002. TLB and snoop energy-reduction using virtual caches for low-power chip-multiprocessor. In Proceedings of the IEEE/ACM International Symposium on Low Power Electronics and Design, pp. 243--246. 34Google Scholar
- R. Elbaz, L. Torres, G. Sassatelli, P. Guillemin, C. Anguille, M. Bardouillet, C. Buatois, and J. B. Rigaud. 2005. Hardware engines for bus encryption: A survey of existing techniques. In Proceedings of the Conference on Design, Automation and Test in Europe, DATE '05, Volume 3, pp. 40--45. 47Google Scholar
- H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, pp. 365--376. ACM, New York. 3Google Scholar
- K. I. Farkas, P. Chow, N. P. Jouppi, and Z. Vranesic. 1997. The multicluster architecture: Reducing cycle time through partitioning. In Proceedings of the 30th International Symposium on Microarchitecture, pp. 149--159. 34Google Scholar
- F. Fiori and F. Musolino. 2001. Analysis of EME produced by a microcontroller operation. In Proceedings of the Conference on Design, Automation and Test in Europe, DATE '01, pp. 341--347. IEEE Press, Piscataway, NJ. 49Google Scholar
- A. Fiskiran and R. Lee. October 2004. Runtime execution monitoring (REM) to detect and prevent malicious code execution. In Proceedings of the IEEE International Conference on Computer Design, pp. 452--457. 49Google Scholar
- K. Flautner, N. Kim, S. Martin, D. Blaauw, and T. Mudge. May 2002. Drowsy caches: Simple techniques for reducing leakage power. In Proceedings of the Annual International Symposium on Computer Architecture, pp. 147--157. 34Google Scholar
- K. Gandolfi, C. Mourtel, and F. Olivier. 2001. Electromagnetic analysis: Concrete results. In Proceedings of the Third International Workshop on Cryptographic Hardware and Embedded Systems, CHES '01, pp. 251--261. Springer-Verlag, London. 49Google Scholar
- B. Gassend, G. E. Suh, D. Clarke, M. V. Dijk, and S. Devadas. 2003. Caches and hash trees for efficient memory integrity verification. In 9th International Symposium on High Performance Computer Architecture, pp. 295--306. 49Google Scholar
- O. Gelbart, P. Ott, B. Narahari, R. Simha, A. Choudhary, and J. Zambreno. May 2005. CODESSEAL: Compiler/FPGA approach to secure applications. In Proceedings of the IEEE International Conference on Intelligence and Security Informatics, pp. 530--535. 49Google Scholar
- K. Ghose and M. Kamble. August 1999. Reducing power in superscalar processor caches using subbanking, multiple line buffers and bit-line segmentation. In Proceedings of the IEEE/ACM International Symposium on Low Power Electronics and Design, pp. 70--75. 34Google Scholar
- B. Grigorian, N. Farahpour, and G. Reinman. February 2015. Brainiac: Bringing reliable accuracy into neurally-implemented approximate computing. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium, pp. 615--626. 26Google Scholar
- F. Guo and Y. Solihin. June 2006. An analytical model for cache replacement policy performance. In SIGMETRICS '06/Performance '06: Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, pp. 228--239. 35Google Scholar
- Y. Guo, Q. Zhuge, J. Hu, J. Yi, M. Qiu, and E. H.-M. Sha. June 2013. Data placement and duplication for embedded multicore systems with scratch pad memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 32(6): 809--817. 35Google Scholar
Digital Library
- L. Hammond, B. Nayfeh, and K. Olukotun. 1997. A single-chip multiprocessor. IEEE Computer, pp. 79--85. 34Google Scholar
- T. D. Han and T. S. Abdelrahman. 2011. Reducing branch divergence in GPU programs. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, pp. 3:1--3:8. ACM, New York. 22Google Scholar
- K. Hazelwood. 2011. Dynamic Binary Modification: Tools, Techniques, and Applications. Morgan & Claypool Publishers, San Rafael, CA. 71Google Scholar
- N. Hemsoth and T. P. Morgan. 2017. FPGA Frontiers: New Applications in Reconfigurable Computing. Next Platform Press, High Point, NC. 23Google Scholar
- J.-M. Hoc, ed. 1990. Psychology of Programming, 1. Elsevier, New York. 69Google Scholar
- R. Huang, D. Y. Deng, and G. E. Suh. March 2010. Orthrus efficient software integrity protection on multi-cores. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 371--384. 49Google Scholar
- G. F. Hughes and J. F. Murray. February 2005. Reliability and security of RAID storage systems and D2D archives using SATA disk drives. In IEEE Transactions on Storage, 1(1): 95--107. 47Google Scholar
Digital Library
- W. W. Hwu. 2015. Heterogeneous System Architecture: A New Compute Platform Infrastructure, 1. Morgan Kaufmann, Burlington, MA. 85Google Scholar
- K. Inoue, V. Moshnyaga, and K. Murakami. February 2002. Trends in high-performance, low-power cache memory architectures. IEICE Transactions on Electronics, E85-C(2): 303--314. 34Google Scholar
- T. B. Jablin, P. Prabhu, J. A. Jablin, N. P. Johnson, S. R. Beard, and D. I. August. 2011. Automatic CPU-GPU communication management and optimization. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '11, pp. 142--151. 19Google Scholar
- A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely, and J. Emer. 2008. Adaptive insertion policies for managing shared caches. In PACT '08: Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, pp. 208--219. 51Google Scholar
- A. Jaleel, J. Nuzman, A. Moga, S. Steely, and J. Emer. February 2015. High performing cache hierarchies for server workloads: Relaxing inclusion to capture the latency benefits of exclusive caches. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium, pp. 343--353. 51Google Scholar
- J. Jeong and M. Dubois. February 2003. Cost-sensitive cache replacement algorithms. In Proceedings of the 9th IEEE Symposium on High Performance Computer Architecture, pp. 327--337. 35, 51Google Scholar
- N.E. Jerger, T. Krishna, and L.-S. Peh. 2017. On-Chip Networks. Morgan & Claypool Publishers, San Rafael, CA. 36Google Scholar
- Y. Jin, N. Kupp, and Y. Makris. 2009. Experiences in hardware Trojan design and implementation. In Proceedings of the 2009 IEEE International Workshop on Hardware-Oriented Security and Trust, HST '09, pp. 50--57. IEEE Computer Society, Washington, DC. 49Google Scholar
- N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA '17, pp. 1--12. ACM, New York. 27Google Scholar
Digital Library
- D. R. Kaeli, P. Mistry, D. Schaa, and D. P. Zhang. 2015. Heterogeneous Computing with OpenCL 2.0, 3. Morgan Kaufmann, Burlington, MA. 78Google Scholar
- M. Kamble and K. Ghose. August 1997. Analytical energy dissipation models for low power caches. In Proceedings of the IEEE/ACM International Symposium on Low Power Electronics and Design, pp. 143--148. 34Google Scholar
- S. Kang, H. J. Choi, C. H. Kim, S. W. Chung, D. Kwon, and J. C. Na. 2011. Exploration of CPU/GPU co-execution: From the perspective of performance, energy, and temperature. In Proceedings of the 2011 ACM Symposium on Research in Applied Computation, RACS '11, pp. 38--43. 17Google Scholar
- T. Karkhanis and J. E. Smith. June 2002. A day in the life of a data cache miss. In Proceedings of the 2nd Annual Workshop on Memory Performance Issues (WMPI). 33Google Scholar
- R. Karri, J. Rajendran, K. Rosenfeld, and M. Tehranipoor. 2010. Trustworthy hardware: Identifying and classifying hardware Trojans. Computer, 43: 39--46. 47, 49Google Scholar
Digital Library
- R. Karri, K. Wu, P. Mishra, and Y. Kim. 2001. Concurrent error detection of fault-based side-channel cryptanalysis of 128-bit symmetric block ciphers. In Proceedings of the 38th annual Design Automation Conference, DAC '01, pp. 579--584. ACM, New York. 49Google Scholar
- S. Kaxiras, Z. Hu, and M. Martonosi. June 2001. Cache decay: Exploiting generational behavior to reduce cache leakage power. In Proceedings of the 28th IEEE/ACM International Symposium on Computer Architecture, pp. 240--251. 34Google Scholar
- G. S. Kc, A. D. Keromytis, and V. Prevelakis. 2003. Countering code-injection attacks with instruction-set randomization. In Proceedings of the 10th ACM Conference on Computer and Communications Security, CCS '03, pp. 272--280. ACM, New York. 48Google Scholar
- M. Kharbutli and Y. Solihin. October 2005. Counter-based cache replacement algorithms. In Proceedings of the International Conference on Computer Design, pp. 61--68. 51Google Scholar
- H. Kim, N. Vijaykrishnan, M. Kandemir, A. Sivasubramaniam, M. Irwin, and E. Geethanjali. August 2001. Power-aware partitioned cache architectures. In Proceedings of the IEEE/ACM International Symposium on Low Power Electronics and Design, pp. 64--67. 34Google Scholar
- J. Kim, W. J. Dally, S. Scott, and D. Abts. 2008. Technology-driven, highly-scalable dragonfly topology. In Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA '08, pp. 77--88. IEEE Computer Society, Washington, DC. 41Google Scholar
- N. Kim, K. Flautner, D. Blaauw, and T. Mudge. November 2002. Drowsy instruction caches: Leakage power reduction using dynamic voltage scaling and cache sub-bank prediction. In Proceedings of the IEEE/ACM 35th International Symposium on Microarchitecture, pp. 219--230. 34Google Scholar
- N. Kim, K. Flautner, D. Blaauw, and T. Mudge. February 2004a. Circuit and microarchitectural techniques for reducing cache leakage power. IEEE Transactions on VLSI 12(2): 167--184. 34Google Scholar
Digital Library
- S. Kim, D. Chandra, and Y. Solihin. 2004b. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT '04: Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, pp. 111--122. 34Google Scholar
- J. Kin, M. Gupta, and W. H. Mangione-Smith. 1997. The filter cache: An energy efficient memory structure. In Proceedings of the 30th Annual International Symposium on Microarchitecture (MICRO-30), pp. 184--193. 33Google Scholar
- M. J. Kobrinsky, B. A. Block, J.-F. Zheng, B. C. Barnett, E. Mohammed, M. Reshotko, F. Robinson, S. List, I. Young, and K. Cadien. May 2004. On-chip optical interconnects. Intel Technology Journal, 8(2): 129--142. 39Google Scholar
- A. K. Kodi and A. Louri. March 2007. Power-aware bandwidth-reconfigurable optical interconnects for high-performance computing (HPC) systems. In IEEE Parallel and Distributed Processing Symposium. IPDPS 2007, pp. 1--10. 39Google Scholar
- J. Kong, O. Aciicmez, J.-P. Seifert, and H. Zhou. 2008. Deconstructingnewcache designs for thwarting software cache-based side channel attacks. In Proceedings of the 2nd ACM Workshop on Computer Security Architectures, CSAW '08, pp. 25--34. ACM, New York. 49Google Scholar
- P. Kongetira, K. Aingaran, and K. Olukotun. March 2005. Niagara: A 32-way multithreaded SPARC processor. IEEE Micro, 25(2): 21--29. 35Google Scholar
Digital Library
- V. Krishnan and J. Torrellas. 1999. A chip-multiprocessor architecture with speculative multithreading. IEEE Transactions on Computers, 48(9): 866--880. 34Google Scholar
Digital Library
- R. Kumar, V. Zyuban, and D. Tullsen. June 2005. Interconnection in multi-core architectures: Understanding mechanisms, overheads, and scaling. In International Symposium on Computer Architecture, pp. 408--419. 36Google Scholar
- G. Kurian, J. E. Miller, J. Psota, J. Eastep, J. Liu, J. Michel, L. C. Kimerling, and A. Agarwal. 2010. Atac: A 1000-core cache-coherent processor with on-chip optical network. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT '10, pp. 477--488. ACM, New York. 39Google Scholar
- H. Lee, G. Tyson, and M. Farrens. December 2000. Eager writeback---a technique for improving bandwidth utilization. In Proceedings of the IEEE/ACM 33nd International Symposium on Microarchitecture, pp. 11--21. 60Google Scholar
- J.-H. Lee and S.-D. Kim. 2002. Application-adaptive intelligent cache memory system. ACM Transactions on Embedded Computing Systems, 1(1): 56--78. 33Google Scholar
Digital Library
- R. B. Lee, D. K. Karig, J. P. McGregor, and Z. Shi. March 2003. Enlisting hardware architecture to thwart malicious code injection. In Proceedings of the International Conference on Security in Pervasive Computing, pp. 237--252. 48Google Scholar
- J. Lin. 2008. On malicious software classification. In Proceedings of the 2008 International Symposium on Intelligent Information Technology Application Workshops, pp. 368--371. IEEE Computer Society, Washington, DC. 47Google Scholar
Digital Library
- J. L. Lo, J. S. Emer, H. M. Levy, R. L. Stamm, and D. M. Tullsen. 1997. Convertingthread-level parallelism to instruction-level parallelism via simultaneous multithreading. ACM Transactions on Computer Systems, 15(3): 322--354. 3, 35Google Scholar
Digital Library
- G. H. Loh, N. Jayasena, M. Oskin, M. Nutter, D. Roberts, M. R. Meswani, D. P. Zhang, and M. Ignatowski. 2013. A processing-in-memory taxonomy and a case for studying fixed-function PIM. 1st Workshop on Near Data Processing, held in conjunction with the 46th IEEE/ACM International Symposium on Microarchitecture (MICRO 46). 91Google Scholar
- D. McGinn-Combs. February 2007. Security architecture and models. http://www.giac.org/resources. 47Google Scholar
- N. Megiddo and D. s. Modha, 2004. Outperforming LRU with an Adaptive Replacement Cache Algorithm. Computer 37(4): 58--65. 35Google Scholar
Digital Library
- D. S. Modha, R. Ananthanarayanan, S. K. Esser, A. Ndirango, A. J. Sherbondy, and R. Singh. August 2011. Cognitive computing. Communications of the ACM, 54(8): 62--71. 25Google Scholar
Digital Library
- G. E. Moore. April 1965. Cramming more components onto integrated circuits. Electronics, pp. 114--117. 2Google Scholar
- A. Moshovos. June 2005. Regionscout: Exploiting coarse grain sharing in snoop-based coherence. In Proceedings of the 32nd IEEE/ACM International Symposium on Computer Architecture, pp. 234--245. 34Google Scholar
Digital Library
- B. A. Nayfeh. 1998. The case for a single-chip multiprocessor. PhD thesis, Stanford University, Stanford, CA. 34Google Scholar
- Nergal. December 2001. Advanced return-into-lib(c) exploits (PaX case study). http://www.phrack.org/. 48Google Scholar
- M. Nijim, X. Qin, and T. Xie. November 2006. Modeling and improving security of a local disk system for write-intensive workloads. ACM Transactions on Storage, 2(4): 400--423. 47Google Scholar
Digital Library
- C. J. Nitta, M. K. Farrens, and V. Akella. 2013. On-Chip Photonic Interconnects: A Computer Architect's Perspective. Morgan & Claypool Publishers, San Rafael, CA. 39Google Scholar
- NVIDIA, 2017. NVIDIA Tesla v100 GPU architecture. http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf 19Google Scholar
- K. Patel and S. Parameswaran. June 2008. SHIELD: A software hardware design methodology for security and reliability of MPSoCs. In Proceedings of the ACM/IEEE Design Automation Conference, pp. 858--861. 50Google Scholar
- J.-K. Peir, W. Hsu, H. Young, and S. Ong. 1996. Improving cache performance with balanced tag and data paths. In Proceedings of the International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS), pp. 268--278. 34Google Scholar
- J.-K. Peir, Y. Lee, and W. Hsu. 1998. Capturing dynamic memory reference behavior with adaptive cache toplogy. In Proceedings of the International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS), pp. 240--250. 33Google Scholar
- G. Pekhimenko, T. Huberty, R. Cai, O. Mutlu, P. Gibbons, M. Kozuch, and T. Mowry. February 2015. Exploiting compressed block size as an indicator of future reuse. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium, pp. 51--63. 35Google Scholar
- M. Potkonjak, A. Nahapetian, M. Nelson, and T. Massey. 2009. Hardware Trojan horse detection using gate-level characterization. In Proceedings of the 46th Annual Design Automation Conference, DAC '09, pp. 688--693. ACM, NewYork. 49Google Scholar
- S. M. Potter. 2001. What can artificial intelligence get from neuroscience? In Artificial Intelligence Festschrift: The Next 50 Years, pp. 174--185. Springer-Verlag, New York. 26Google Scholar
- K. Punniyamurthy and A. Gerstlauer. 2017. Exploring non-uniform processing in-memory architectures. In 1st Workshop on Hardware/Software Techniques for Minimizing Data Movement, held in conjunction with PACT. 91Google Scholar
- M. Qureshi, A. Jaleel, Y. Patt, S. C. Steely, and J. Emer. June 2007. Adaptive insertion policies for high performance caching. In Proceedings of the 34th International Symposium on Computer Architecture (ISCA), pp. 381--391. 51, 60Google Scholar
- M. Qureshi, D. Lynch, O. Mutlu, and Y. Patt. June 2006. A case for MLP-aware cache replacement. In Proceedings of the 33rd International Symposium on Computer Architecture (ISCA), pp. 167--178. 35, 51Google Scholar
- M. K. Qureshi and Y. N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 423--432. 34Google Scholar
- R. Ramanathan. 2006. Intel multi-core processors: Making the move to quad-core and beyond. White paper, Intel Corporation. 35Google Scholar
- J. Reineke, D. Grund, C. Berg, and R. Wilhelm. September 2006. Predictability of cache replacement policies. Reports of SFB/TR 14 AVACS 9, SFB/TR 14 AVACS. http://www.avacs.org 35Google Scholar
- A. Ros, M. Davari, and S. Kaxiras. February 2015. Hierarchical private/shared classification: The key to simple and efficient coherence for clustered cache hierarchies. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium, pp. 186--197. 35Google Scholar
- X. Ruan, A. Manzanares, S. Yin, M. Nijim, and X. Qin. 2009. Can we improve energy efficiency of secure disk systems without modifying security mechanisms? In Proceedings of the 2009 IEEE International Conference on Networking, Architecture, and Storage, NAS '09, pp. 413--420. 47Google Scholar
- K. Rupp. 2018. 42 years of microprocessor trend data. https://github.com/karlrupp/microprocessor-trend-data (last accessed March 2018). 2Google Scholar
- S. K. Sadasivam, B. W. Thompto, R. Kalla, and W. J. Starke. March 2017. IBM power9 processor architecture. IEEE Micro, 37(2): 40--51. 15Google Scholar
Digital Library
- S. Sayyaparaju, G. Chakma, S. Amer, and G. S. Rose. 2017. Circuit techniques for online learning of memristive synapses in CMOS-memristor neuromorphic systems. In Proceedings of the Great Lakes Symposium on VLSI 2017, GLSVLSI '17, pp. 479--482. ACM, New York. 26, 92Google Scholar
- M. Schuette and J. Shen. March 1987. Processor control flow monitoring using signatured instruction streams. IEEE Transactions on Computers, C-36(3): 264--276. 49Google Scholar
Digital Library
- C. D. Schuman, T. E. Potok, R. M. Patton, J. D. Birdwell, M. E. Dean, G. S. Rose, and J. S. Plank. May 2017. A survey of neuromorphic computing and neural networks in hardware. ArXiv e-prints. https://arxiv.org/abs/1705.06963 92Google Scholar
- V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons, and T. C. Mowry. 2016. Buddy-RAM: Improving the performance and efficiency of bulk bitwise operations using DRAM. https://arxiv.org/abs/1611.09988 91Google Scholar
- A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar. 2016. Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In Proceedings of the 43rd International Symposium on Computer Architecture, ISCA '16, pp. 14--26. IEEE Press, Piscataway, NJ. 92Google Scholar
- R. Sheikh and M. Kharbutli. October 2010. Improving cache performance by combining cost-sensitivity and locality principles in cache replacement algorithms. In Proceedings of the International Conference on Computer Design (ICCD), pp. 76--83. 51Google Scholar
- P. Siegl, R. Buchty, and M. Berekovic. 2016. Data-centric computing frontiers: A survey on processing-in-memory. In Proceedings of the Second International Symposium on Memory Systems, MEMSYS '16, pp. 295--308. ACM, New York. 91Google Scholar
- B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner. 2005. Power5 system microarchitecture. IBM Journal of Research and Development, 49(4/5): 505--521. 35Google Scholar
Digital Library
- A. Smith. 1982. Cache memories. ACM Computing Surveys, 14(3): 473--530. 33Google Scholar
Digital Library
- F.-X. Standaert, T. G. Malkin, and M. Yung. 2009. A unified framework for the analysis of side-channel key recovery attacks. In Proceedings of the 28th Annual International Conference on Advances in Cryptology: The Theory and Applications of Cryptographic Techniques, EUROCRYPT '09, pp. 443--461. Springer-Verlag, Berlin, Heidelberg, pp. 443--461. 47, 49Google Scholar
- L. Su, S. Courcambeck, P. Guillemin, C. Schwarz, and R. Pacalet. 2009. SecBus: Operating system controlled hierarchical page-based memory bus protection. In Proceedings of the Conference on Design, Automation and Test in Europe, DATE '09, pp. 570--573. 47Google Scholar
- H. Sutter. March 2005. The free lunch is over: A fundamental turn toward concurrency in software. Dr. Dobb's Journal, 30(3): 202--210. 4Google Scholar
- TCG. April 2008. Trusted platform module (TPM) summary. http://www.trustedcomputinggroup.org/. 47Google Scholar
- M. Tehranipoor and F. Koushanfar. January 2010. A survey of hardware Trojan taxonomy and detection. IEEE Design & Test of Computers, 27(1): 10--25. 49Google Scholar
Digital Library
- A. Tereshkin. 2010. Evil maid goes after PGP whole disk encryption. In Proceedings of the 3rd International Conference on Security of Information and Networks, SIN '10, p. 2. ACM, New York. 49Google Scholar
Digital Library
- K. Tiri. 2007. Side-channel attack pitfalls. In Proceedings of the 44th Annual Design Automation Conference, DAC '07, pp. 15--20. ACM, New York. 47, 49Google Scholar
Digital Library
- M. Tomasevic and V. Milutinovic. 1993. The Cache Coherence Problem in Shared-Memory Multiprocessors: Hardware Solutions. IEEE Computer Society Press, Los Alamitos, CA. 34Google Scholar
- D. M. Tullsen, S. Eggers, and H. M. Levy. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of the 22nd International Symposium on Computer Architecture, pp. 392--403. 3, 14, 34, 35Google Scholar
- D. Unat, A. Dubey, T. Hoefler, J. Shalf, M. Abraham, M. Bianco, B. L. Chamberlain, et al. October 2017. Trends in data locality abstractions for HPC systems. IEEE Transactions on Parallel and Distributed Systems, 20(10): 3007--3020. 64Google Scholar
Cross Ref
- US Department of Energy. April 2013. Technical challenges of exascale computing. Technical Report JSR-12-310. https://fas.org/irp/agency/dod/jason/exascale.pdf.Google Scholar
- US Department of Energy. 2016. Neuromorphic computing, architectures, models, and applications: A beyond-CMOS approach to future computing. Technical report, Oak Ridge National Laboratory. 94Google Scholar
- A. S. Vaidya, A. Shayesteh, D. H. Woo, R. Saharoy, and M. Azimi. 2013. SIMD divergence optimization through intra-warp compaction. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pp. 368--379. ACM, New York. 22Google Scholar
- K. Varadarajan, S. Nandy, V. Sharda, A. Bharadwaj, R. Iyer, S. Makineni, and D. Newell. June 2006. Molecular caches: A caching structure for dynamic creation of application-specific heterogeneous cache regions. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39), pp. 433--442. 34Google Scholar
- R. Vaslin, G. Gogniat, J.-P. Diguet, E. Wanderley, R. Tessier, and W. Burleson. February 2009. A security approach for off-chip memory in embedded microprocessor systems. Microprocessors and Microsystems, 33(1): 37--45. 47Google Scholar
Digital Library
- A. V. Veidenbaum, W. Tang, R. Gupta, A. Nicolau, and X. Ji. 1999. Adapting cache line size to application behavior. In Proceedings of the 1999 International Conference on Supercomputing, pp. 145--154. 33, 34Google Scholar
- A. Waksman and S. Sethumadhavan. 2010. Tamper evident microprocessors. In Proceedings of the 2010 IEEE Symposium on Security and Privacy, SP '10, pp. 173--188. IEEE Computer Society, Washington, DC. 47, 49Google Scholar
- K. Wang, K. Angstadt, C. Bo, N. Brunelle, E. Sadredini, T. Tracy II, J. Wadden, M. Stan, and K. Skadron. 2016. An overview of Micron's automata processor. In Proceedings of the Eleventh IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, CODES '16, pp. 14:1--14:3. ACM, New York. 24Google Scholar
- P. Wang, D. Feng, W. Wu, and L. Zhang. 2009. On the correctness of an approach against side-channel attacks. In Proceedings of the 5th International Conference on Information Security Practice and Experience, ISPEC '09, pp. 336--344. Springer-Verlag, Berlin, Heidelberg. 47, 49Google Scholar
- X. Wang, H. Salmani, M. Tehranipoor, and J. Plusquellic. 2008. Hardware Trojan detection and isolation using current integration and localized current analysis. In Proceedings of the 2008 IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems, pp. 87--95. IEEE Computer Society, Washington, DC. 49Google Scholar
- E. Wheeler. September 2008. Replay attacks. http://www.sans.org/. 48Google Scholar
- W. Wong and J.-L. Baer. January 2000. Modified LRU policies for improving second level cache behavior. In Sixth International Symposium on High-Performance Computer Architecture (HPCA-6), pp. 49--60. 35Google Scholar
- S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA '95, pp. 24--36. ACM, New York. 51Google Scholar
- X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. February 2015. Coordinated static and dynamic cache bypassing for GPUs. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pp. 76--88. 35Google Scholar
- J. Xue, A. Garg, B. Ciftcioglu, J. Hu, S. Wang, I. Savidis, M. Jain, et al. 2010. An intra-chip free-space optical interconnect. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA '10, pp. 94--105. ACM, New York. 39Google Scholar
Digital Library
- C. Yan, D. Englender, M. Prvulovic, B. Rogers, and Y. Solihin. 2006. Improving cost, performance, and security of memory encryption and authentication. In Proceedings of the 33rd Annual International Symposium on Computer Architecture, ISCA '06, pp. 179--190. 47Google Scholar
- H. Yang, R. Govindarajan, G. R. Gao, and Z. Hu. December 2005. Improving power efficiency with compiler-assisted cache replacement. Journal of Embedded Computing, 1(4): 487--499. 51Google Scholar
Digital Library
- T. T. Ye. 2003. Physical planning for on-chip multiprocessor networks and switch fabrics. In 14th IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP'03), pp. 97--107. 34Google Scholar
Cross Ref
- M. Ying. 2016. Foundations of Quantum Programming, 1. Morgan Kaufmann, Burlingon, MA. 95Google Scholar
- M. Zahran. March 2016. Brain-inspired machines: What, exactly, are we looking for? IEEE Pulse, 7(2): 48--51. 26Google Scholar
Cross Ref
- W. Zhang, M. Kandemir, M. Karakoy, and G. Chen. August 2005. Reducing data cache leakage energy using a compiler-based approach. ACM Transactions on Embedded Computing Systems, 4(3): 652--678.Google Scholar
Digital Library
Cited By
-
Fernandes L, Kharate P and Singh B (2024). The Future of High Performance Computing in Biomimetics and Some Challenges High Performance Computing in Biomimetics, 10.1007/978-981-97-1017-1_15, (287-303),
-
Carratalá-Sáez R, Torres Y, Sierra-Pallares J, López-Huguet S and Llanos D (2023). UVaFTLE: Lagrangian finite time Lyapunov exponent extraction for fluid dynamic applications, The Journal of Supercomputing, 10.1007/s11227-022-05017-x, 79:9, (9635-9665), Online publication date: 1-Jun-2023.
-
Nikolic G, Dimitrijevic B, Nikolic T and Stojcev M (2022). A Survey of Three Types of Processing Units: CPU, GPU and TPU 2022 57th International Scientific Conference on Information, Communication and Energy Systems and Technologies (ICEST), 10.1109/ICEST55168.2022.9828625, 978-1-6654-8500-5, (1-6)
-
Cérin C, Kimura K and Sow M (2022). Data stream clustering for low-cost machines, Journal of Parallel and Distributed Computing, 10.1016/j.jpdc.2022.04.009, Online publication date: 1-Apr-2022.
-
Nikolic G, Dimitrijevic B, Nikolic T and Stojcev M (2022). Fifty years of microprocessor evolution: from single CPU to multicore and manycore systems, Facta universitatis - series: Electronics and Energetics, 10.2298/FUEE2202155N, 35:2, (155-186),
-
Zahran M (2021). The Future of High-Performance Computing 2021 17th International Computer Engineering Conference (ICENCO), 10.1109/ICENCO49852.2021.9698918, 978-1-7281-6448-9, (129-134)
-
Long D, Morkos B and Ferguson S (2021). Toward Quantifiable Evidence of Excess’ Value Using Personal Gaming Desktops, Journal of Mechanical Design, 10.1115/1.4049520, 143:3, Online publication date: 1-Mar-2021.
-
Armstrong M (2020). High Performance Computing for Geospatial Applications: A Prospective View High Performance Computing for Geospatial Applications, 10.1007/978-3-030-47998-5_15, (271-284),
Recommendations
Collaborative Computing for Heterogeneous Integrated Systems
ICPE '17: Proceedings of the 8th ACM/SPEC on International Conference on Performance EngineeringComputing systems today typically employ, in addition to powerful CPUs, various types of specialized devices such as Graphics Processing Units (GPUs) and Field-Programmable Gate Arrays (FPGAs). Such heterogeneous systems are evolving towards tighter ...
A Halide-based Synergistic Computing Framework for Heterogeneous Systems
New programming models have been developed to embrace contemporary heterogeneous machines, each of which may contain several types of processors, e.g., CPUs, GPUs, FPGAs and ASICs. Unlike the conventional ones, which use separate programming schemes for ...