ABSTRACT
Cloud services are becoming increasingly globalized and data-center workloads are expanding exponentially. GPU and FPGA-based clouds have illustrated improvements in power and performance by accelerating compute-intensive workloads. ASIC-based clouds are a promising way to optimize the Total Cost of Ownership (TCO) of a given datacenter computation (e.g. YouTube transcoding) by reducing both energy consumption and marginal computation cost.
The feasibility of an ASIC Cloud for a particular application is directly gated by the ability to manage the Non-Recurring Engineering (NRE) costs of designing and fabricating the ASIC, so that it is significantly lower (e.g. 2X) than the TCO of the best available alternative.
In this paper, we show that technology node selection is a major tool for managing ASIC Cloud NRE, and allows the designer to trade off an accelerator's excess energy efficiency and cost performance for lower total cost.
We explore NRE and cross-technology optimization of ASIC Clouds for four different applications: Bitcoin mining, YouTube-style video transcoding, Litecoin, and Deep Learning. We address these challenges and show large reductions in the NRE, potentially enabling ASIC Clouds to address a wider variety of datacenter workloads. Our results suggest that advanced nodes like 16nm will lead to sub-optimal TCO for many workloads, and that use of older nodes like 65nm can enable a greater diversity of ASIC Clouds.
- M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: a system for large-scale machine learning.In OSDI, 2016.Google ScholarDigital Library
- M. Abdelfattah, A. Hagiescu, and D. Singh.Gzip on a chip: High performance lossless data compression on FPGAs using opencl.In International Workshop on OpenCL (IWOC, 2014.Google ScholarDigital Library
- J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi.A Scalable Processing-in-memory Accelerator for Parallel Graph Processing.In ISCA, 2015. Google ScholarDigital Library
- J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. Jerger, and A. Moshovos.Cnvlutin: ineffectual-neuron-free deep neural network computing.In ISCA, 2016. Google ScholarDigital Library
- K. Asanovic, R. Avizienis, J. Bachrach, S. Beamer, D. Biancolin, C. Celio, H. Cook, D. Dabbelt, J. Hauser, A. Izraelevitz, S. Karandikar, B. Keller, D. Kim, and J. Koenig.The Rocket Chip Generator.Technical Report No. UCB/EECS-2016--17, 2016.Google Scholar
- J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avizienis, J. Wawrzynek, and K. Asanovic.Chisel: Constructing hardware in a Scala embedded language.In DAC, 2012.Google ScholarDigital Library
- J. Balkind, M. McKeown, Y. Fu, T. Nguyen, Y. Zhou, A. Lavrov, M. Shahrad, A. Fuchs, S. Payne, X. Liang, M. Matl, and D. Wentzlaff.OpenPiton: An Open Source Manycore Research Framework.In ASPLOS, 2016.Google ScholarDigital Library
- L. Barroso, J. Clidaras, and U. Holzle.\ The Datacenter As a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition. SynthesisLectures on Computer Architecture, 2013.Google Scholar
- J. Beetem, M. Denneau, and D. Weingarten.The GF11 Supercomputer.In ISCA, 1985. Google ScholarDigital Library
- M. Bojnordi, and E. Ipek.Memristive Boltzmann Machine: A Hardware Accelerator for Combinatorial Optimization and Deep Learning.In HPCA, 2016. Google ScholarCross Ref
- I. Bolsens.2.5 D ICs: Just a Stepping Stone or a Long Term Alternative to 3D?. Keynote Talk at 3-D Architectures for Semiconductor Integration and Packaging Conference, 2011.Google Scholar
- A. Caulfield, E. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman, S. Heil, M. Humphrey, P. Kaur, J. Kim, D. Lo, T. Massengill, K. Ovtcharov, M. Papamichael, L. Woods, S. Lanka, D. Chiou, and D. BurgerA Cloud-Scale Acceleration Architecture.In MICRO, 2016.Google ScholarCross Ref
- Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam.DaDianNao: A Machine-Learning Supercomputer.In MICRO, 2014. Google ScholarDigital Library
- Y. Chen, J. Emer, and V. Sze.Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks.In ISCA, 2016. Google ScholarDigital Library
- Q. Chen, H. Yang, J. Mars, and L. Tang.Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers.In ASPLOS, 2016.Google ScholarDigital Library
- P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie.PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory.In ISCA, 2016. Google ScholarDigital Library
- H. Esmaeilzadeh, E. Blem, R. Amant, K. Sankaralingam, and D. Burger.Dark Silicon and the End of Multicore Scaling.In ISCA, 2011. Google ScholarDigital Library
- V. Gangadhar, R. Balasubramanian, M. Drumond, Z. Guo, J. Menon, C. Joseph, R. Prakash, S. Prasad, P. Vallathol, and K. Sankaralingam.MIAOW: An open source GPGPU.In IEEE Hot Chips 27 Symposium, 2015.Google Scholar
- Glassdoor.Glassdoor salaries, 2016.https://www.glassdoor.comGoogle Scholar
- V. Gogte, A. Kolli, M. Cafarella, L. D'Antoni, and T. Wenisch.HARE: Hardware accelerator for regular expressions.In MICRO, 2016.Google ScholarCross Ref
- N. Goulding, J. Sampson, G. Venkatesh, S. Garcia, J. Auricchio, J. Babb, M. Taylor, and S. Swanson.GreenDroid: A mobile application processor for a future of dark silicon.In IEEE Hot Chips 22 Symposium, 2010. Google ScholarCross Ref
- N. Goulding-Hotta, J. Sampson, G. Venkatesh, S. Garcia, J. Auricchio, P. Huang, M. Arora, S. Nath, V. Bhatt, J. Babb, S. Swanson, and M. Taylor.The GreenDroid Mobile Application Processor: An Architecture for Silicon's Dark Future.In IEEE MICRO, 2011.Google ScholarDigital Library
- B. Gu, A. Yoon, D. Bae, I. Jo, J. Lee, J. Yoon, J. Kang, M. Kwon, C. Yoon, S. Cho, J. Jeong, and D. Chang.Biscuit: a framework for near-data processing of big data workloads.In ISCA, 2016. Google ScholarDigital Library
- A. Gutierrez, M. Cieslak, B. Giridhar, R. G. Dreslinski, L. Ceze, and T. Mudge.Integrated 3D-stacked Server Designs for Increasing Physical Density of Key-value Stores.In ASPLOS, 2014.Google ScholarDigital Library
- T. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi.Graphicionado: A High-Performance and Energy-Efficient Accelerator for Graph Analytics.In MICRO, 2016. Google ScholarCross Ref
- R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz.Understanding sources of inefficiency in general-purpose chips.In ISCA, 2012.Google Scholar
- S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and W. Dally. EIE: Efficient Inference Engine on Compressed Deep Neural Network.In ISCA, 2016.Google ScholarDigital Library
- J. Hauswald, M. Laurenzano, Y. Zhang, C. Li, A. Rovinski, A. Khurana, R. Dreslinski, T. Mudge, V. Petrucci, L. Tang, and J. Mars.Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers.In ASPLOS, 2015.Google ScholarDigital Library
- Y. Ji, Y. Zhang, S. Li, P. Chi, C. Jiang, P. Qu, Y. Xie, and W. ChenNEUTRAMS: Neural Network Transformation and Co-design under Neuromorphic Hardware Constraints.In MICRO, 2016.Google ScholarCross Ref
- H. Jones.Strategies in Optimizing Market Positions for Semiconductor Vendors Based on IP Leverage.IBS White Paper, 2014.Google Scholar
- C. Ju, T. Liu, K. Lee, Y. Chang, H. Chou, C. Wang, T. Wu, H. Lin, Y. Huang, C. Cheng, T. Lin, C. Chen, Y. Lin, M. Chiu, W. Li, S. Wang, Y. Lai, P. Chao, C. Chien, M. Hu, P. Wang, Y. Huang, S. Chuang, L. Chen, H. Lin, M. Wu, and C. Chen.A 0.5 nJ/Pixel 4 K H.265/HEVC Codec LSI for Multi-Format Smartphone Applications.In JSSC, 2016.Google Scholar
- S. Jun, M. Liu, S. Lee, Hicks, Ankcorn, King, Myron, S. Xu, and Arvind.BlueDBM: An Appliance for Big Data Analytics.In ISCA, 2015.Google ScholarDigital Library
- A. Kannan, N. Jerger, and G. Loh.Enabling Interposer-based Disintegration of Multi-core Processors.In MICRO, 2015. Google ScholarDigital Library
- M. Kim, M. Mehrara, M. Oskin, and T. Austin.Architectural Implications of Brick and Mortar Silicon Manufacturing.In ISCA, 2007. Google ScholarDigital Library
- D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay.Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory.In ISCA, 2016.Google Scholar
- O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ranganathan.Meet the Walkers: Accelerating Index Traversals for In-memory Databases.In MICRO, 2013.Google ScholarDigital Library
- K. Lim, D. Meisner, A. Saidi, P. Ranganathan, and T. Wenisch.Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached.In ISCA, 2013.Google Scholar
- S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen.Cambricon: An Instruction Set Architecture for Neural Networks.In ISCA, 2016.Google Scholar
- I. Magaki, M. Khazraee, L. Vega, M. B. Taylor.ASIC Clouds: Specializing the Datacenter.In ISCA, 2016.Google Scholar
- M. Ozdal, S. Yesil, T. Kim, A. Ayupov, J. Greth, S. Burns, and O. Ozturk.Energy efficient architecture for graph analytics accelerators.In ISCA, 2016. Google ScholarDigital Library
- A. Putnam, A. Caulfield, E. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Xiao, and D. Burger.A Reconfigurable Fabric for Accelerating Large-scale Datacenter Services.In ISCA, 2014. Google ScholarCross Ref
- W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. Horowitz.Convolution engine: balancing efficiency and flexibility in specialized computing.In ISCA, 2013. Google ScholarDigital Library
- B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. Lee, J. Hernández-Lobato, G. Wei, and D. Brooks.Minerva: enabling low-power, highly-accurate deep neural network accelerators.In ISCA, 2016. Google ScholarDigital Library
- J. Sampson, G. Venkatesh, N. Goulding-Hotta, S. Garcia, S. Swanson and M. Taylor.Efficient Complex Operators for Irregular Codes.In HPCA, 2011. Google ScholarCross Ref
- R. Sampson, M. Yang, S. Wei, C. Chakrabarti, and T. Wenisch.Sonic Millip3De: A Massively Parallel 3D-Stacked Accelerator for 3D Ultrasound.In HPCA, 2013.Google ScholarDigital Library
- A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. Strachan, M. Hu, R. Williams, and V. Srikumar.ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars.In ISCA, 2016. Google ScholarDigital Library
- Y. Shao, B. Reagen, G. Wei, and D. Brooks.Aladdin: a Pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures.In ISCA, 2014. Google ScholarDigital Library
- D. Shaw, M. Deneroff, R. Dror, J. Kuskin, R. Larson, J. Salmon, C. Young, B. Batson, K. Bowers, J. Chao, M. Eastwood, J. Gagliardo, J. Grossman, C. Ho, D. Ierardi, I. Kolossváry, J. Klepeis, T. Layman, C. McLeavey, M. Moraes, R. Mueller, E. Priest, Y. Shan, J. Spengler, M. Theobald, B. Towles, and S. Wang.Anton, a Special-purpose Machine for Molecular Dynamics Simulation.In ISCA, 2007. Google ScholarDigital Library
- A. Solomatnikov, A. Firoozshahian, W. Qadeer, O. Shacham, K. Kelley, Z. Asgar, M. Wachs, R. Hameed, and M. Horowitz.Chip Multi-processor Generator.In DAC, 2007. Google ScholarDigital Library
- A. Pedram, S. Richardson, S. Galal, S. Kvatinsky, and M. Horowitz.Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era.In IEEE Design Test, 2016.Google Scholar
- P. Tandon, J. Chang, R. Dreslinski, V. Qazvinian, P. Ranganathan, and T. Wenisch.Hardware Acceleration for Similarity Measurement in Natural Language Processing.In ISLPED, 2013. Google ScholarCross Ref
- M. Taylor.A Landscape of the New Dark Silicon Design Regime.In IEEE Micro, 2013. Google ScholarCross Ref
- M. Taylor.Is Dark Silicon Useful? Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse.In DAC, 2012.Google ScholarDigital Library
- M. Taylor.Bitcoin and the Age of Bespoke Silicon.In CASES, 2013.Google Scholar
- G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. Taylor.Conservation cores: reducing the energy of mature computationsIn ASPLOS, 2010.Google Scholar
- G. Venkatesh, J. Sampson, N. Goulding-Hotta, S. Kota Venkata, M. Taylor, and S. Swanson.QsCores: Configurable Co-processors to Trade Dark Silicon for Energy Efficiency in a Scalable Manner.In MICRO, 2011.Google Scholar
- M. Wachs, O. Shacham, Z. Asgar, A. Firoozshahian, S. Richardson and M. Horowitz.Bringing up a chip on the cheap.\ IEEE Design Test of Computers, 2012. Google ScholarDigital Library
- J. Wong, F. Kourshanfar and M. Potkonjak.Flexible ASIC: shared masking for multiple media processors.In DAC, 2005. Google ScholarDigital Library
- K. Wu, and Y. Tsai.Structured ASIC, Evolution or Revolution?.In Proceedings of the International Symposium on Physical Design (ISPD), 2004. Google ScholarDigital Library
- L. Wu, A. Lottarini, T. Paine, M. Kim, and K. Ross.Q100: The Architecture and Design of a Database Processing Unit.In ASPLOS, 2014.Google ScholarDigital Library
- N. Xu, X. Cai, R. Gao, L. Zhang, and F. Hsu.FPGA Acceleration of RankBoost in Web Search Engines.In ACM Transactions on Reconfigurable Technology and Systems (TRETS), 2009. Google ScholarDigital Library
- R. Yazdani, A. Segura, J. Arnau, and A. Gonzalez.An ultra low-power hardware accelerator for automatic speech recognition.In MICRO, 2016. Google ScholarCross Ref
- B. Zahiri.Structured ASICs: opportunities and challenges.In Proceedings of the 21st International Conference on Computer Design (ICCD), 2003. Google ScholarCross Ref
- S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen.Cambricon-X: An accelerator for sparse neural networks.In MICRO, 2016.Google ScholarDigital Library
Index Terms
- Moonwalk: NRE Optimization in ASIC Clouds
Recommendations
Moonwalk: NRE Optimization in ASIC Clouds
Asplos'17Cloud services are becoming increasingly globalized and data-center workloads are expanding exponentially. GPU and FPGA-based clouds have illustrated improvements in power and performance by accelerating compute-intensive workloads. ASIC-based clouds ...
Moonwalk: NRE Optimization in ASIC Clouds
ASPLOS '17Cloud services are becoming increasingly globalized and data-center workloads are expanding exponentially. GPU and FPGA-based clouds have illustrated improvements in power and performance by accelerating compute-intensive workloads. ASIC-based clouds ...
Extreme Datacenter Specialization for Planet-Scale Computing: ASIC Clouds
Special TopicsPlanet-scale applications are driving the exponential growth of the cloud, and datacenter specialization is the key enabler of this trend, providing order of magnitudes improvements in cost-effectiveness and energy-efficiency. While exascale computing ...
Comments