skip to main content
10.1145/3037697.3037749acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Moonwalk: NRE Optimization in ASIC Clouds

Authors Info & Claims
Published:04 April 2017Publication History

ABSTRACT

Cloud services are becoming increasingly globalized and data-center workloads are expanding exponentially. GPU and FPGA-based clouds have illustrated improvements in power and performance by accelerating compute-intensive workloads. ASIC-based clouds are a promising way to optimize the Total Cost of Ownership (TCO) of a given datacenter computation (e.g. YouTube transcoding) by reducing both energy consumption and marginal computation cost.

The feasibility of an ASIC Cloud for a particular application is directly gated by the ability to manage the Non-Recurring Engineering (NRE) costs of designing and fabricating the ASIC, so that it is significantly lower (e.g. 2X) than the TCO of the best available alternative.

In this paper, we show that technology node selection is a major tool for managing ASIC Cloud NRE, and allows the designer to trade off an accelerator's excess energy efficiency and cost performance for lower total cost.

We explore NRE and cross-technology optimization of ASIC Clouds for four different applications: Bitcoin mining, YouTube-style video transcoding, Litecoin, and Deep Learning. We address these challenges and show large reductions in the NRE, potentially enabling ASIC Clouds to address a wider variety of datacenter workloads. Our results suggest that advanced nodes like 16nm will lead to sub-optimal TCO for many workloads, and that use of older nodes like 65nm can enable a greater diversity of ASIC Clouds.

References

  1. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: a system for large-scale machine learning.In OSDI, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Abdelfattah, A. Hagiescu, and D. Singh.Gzip on a chip: High performance lossless data compression on FPGAs using opencl.In International Workshop on OpenCL (IWOC, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi.A Scalable Processing-in-memory Accelerator for Parallel Graph Processing.In ISCA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. Jerger, and A. Moshovos.Cnvlutin: ineffectual-neuron-free deep neural network computing.In ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. Asanovic, R. Avizienis, J. Bachrach, S. Beamer, D. Biancolin, C. Celio, H. Cook, D. Dabbelt, J. Hauser, A. Izraelevitz, S. Karandikar, B. Keller, D. Kim, and J. Koenig.The Rocket Chip Generator.Technical Report No. UCB/EECS-2016--17, 2016.Google ScholarGoogle Scholar
  6. J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avizienis, J. Wawrzynek, and K. Asanovic.Chisel: Constructing hardware in a Scala embedded language.In DAC, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Balkind, M. McKeown, Y. Fu, T. Nguyen, Y. Zhou, A. Lavrov, M. Shahrad, A. Fuchs, S. Payne, X. Liang, M. Matl, and D. Wentzlaff.OpenPiton: An Open Source Manycore Research Framework.In ASPLOS, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Barroso, J. Clidaras, and U. Holzle.\ The Datacenter As a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition. SynthesisLectures on Computer Architecture, 2013.Google ScholarGoogle Scholar
  9. J. Beetem, M. Denneau, and D. Weingarten.The GF11 Supercomputer.In ISCA, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Bojnordi, and E. Ipek.Memristive Boltzmann Machine: A Hardware Accelerator for Combinatorial Optimization and Deep Learning.In HPCA, 2016. Google ScholarGoogle ScholarCross RefCross Ref
  11. I. Bolsens.2.5 D ICs: Just a Stepping Stone or a Long Term Alternative to 3D?. Keynote Talk at 3-D Architectures for Semiconductor Integration and Packaging Conference, 2011.Google ScholarGoogle Scholar
  12. A. Caulfield, E. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman, S. Heil, M. Humphrey, P. Kaur, J. Kim, D. Lo, T. Massengill, K. Ovtcharov, M. Papamichael, L. Woods, S. Lanka, D. Chiou, and D. BurgerA Cloud-Scale Acceleration Architecture.In MICRO, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  13. Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam.DaDianNao: A Machine-Learning Supercomputer.In MICRO, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Chen, J. Emer, and V. Sze.Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks.In ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Q. Chen, H. Yang, J. Mars, and L. Tang.Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers.In ASPLOS, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie.PRIME: a novel processing-in-memory architecture for neural network computation in ReRAM-based main memory.In ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H. Esmaeilzadeh, E. Blem, R. Amant, K. Sankaralingam, and D. Burger.Dark Silicon and the End of Multicore Scaling.In ISCA, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. V. Gangadhar, R. Balasubramanian, M. Drumond, Z. Guo, J. Menon, C. Joseph, R. Prakash, S. Prasad, P. Vallathol, and K. Sankaralingam.MIAOW: An open source GPGPU.In IEEE Hot Chips 27 Symposium, 2015.Google ScholarGoogle Scholar
  19. Glassdoor.Glassdoor salaries, 2016.https://www.glassdoor.comGoogle ScholarGoogle Scholar
  20. V. Gogte, A. Kolli, M. Cafarella, L. D'Antoni, and T. Wenisch.HARE: Hardware accelerator for regular expressions.In MICRO, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  21. N. Goulding, J. Sampson, G. Venkatesh, S. Garcia, J. Auricchio, J. Babb, M. Taylor, and S. Swanson.GreenDroid: A mobile application processor for a future of dark silicon.In IEEE Hot Chips 22 Symposium, 2010. Google ScholarGoogle ScholarCross RefCross Ref
  22. N. Goulding-Hotta, J. Sampson, G. Venkatesh, S. Garcia, J. Auricchio, P. Huang, M. Arora, S. Nath, V. Bhatt, J. Babb, S. Swanson, and M. Taylor.The GreenDroid Mobile Application Processor: An Architecture for Silicon's Dark Future.In IEEE MICRO, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Gu, A. Yoon, D. Bae, I. Jo, J. Lee, J. Yoon, J. Kang, M. Kwon, C. Yoon, S. Cho, J. Jeong, and D. Chang.Biscuit: a framework for near-data processing of big data workloads.In ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Gutierrez, M. Cieslak, B. Giridhar, R. G. Dreslinski, L. Ceze, and T. Mudge.Integrated 3D-stacked Server Designs for Increasing Physical Density of Key-value Stores.In ASPLOS, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi.Graphicionado: A High-Performance and Energy-Efficient Accelerator for Graph Analytics.In MICRO, 2016. Google ScholarGoogle ScholarCross RefCross Ref
  26. R. Hameed, W. Qadeer, M. Wachs, O. Azizi, A. Solomatnikov, B. Lee, S. Richardson, C. Kozyrakis, and M. Horowitz.Understanding sources of inefficiency in general-purpose chips.In ISCA, 2012.Google ScholarGoogle Scholar
  27. S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and W. Dally. EIE: Efficient Inference Engine on Compressed Deep Neural Network.In ISCA, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Hauswald, M. Laurenzano, Y. Zhang, C. Li, A. Rovinski, A. Khurana, R. Dreslinski, T. Mudge, V. Petrucci, L. Tang, and J. Mars.Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers.In ASPLOS, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Ji, Y. Zhang, S. Li, P. Chi, C. Jiang, P. Qu, Y. Xie, and W. ChenNEUTRAMS: Neural Network Transformation and Co-design under Neuromorphic Hardware Constraints.In MICRO, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  30. H. Jones.Strategies in Optimizing Market Positions for Semiconductor Vendors Based on IP Leverage.IBS White Paper, 2014.Google ScholarGoogle Scholar
  31. C. Ju, T. Liu, K. Lee, Y. Chang, H. Chou, C. Wang, T. Wu, H. Lin, Y. Huang, C. Cheng, T. Lin, C. Chen, Y. Lin, M. Chiu, W. Li, S. Wang, Y. Lai, P. Chao, C. Chien, M. Hu, P. Wang, Y. Huang, S. Chuang, L. Chen, H. Lin, M. Wu, and C. Chen.A 0.5 nJ/Pixel 4 K H.265/HEVC Codec LSI for Multi-Format Smartphone Applications.In JSSC, 2016.Google ScholarGoogle Scholar
  32. S. Jun, M. Liu, S. Lee, Hicks, Ankcorn, King, Myron, S. Xu, and Arvind.BlueDBM: An Appliance for Big Data Analytics.In ISCA, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Kannan, N. Jerger, and G. Loh.Enabling Interposer-based Disintegration of Multi-core Processors.In MICRO, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Kim, M. Mehrara, M. Oskin, and T. Austin.Architectural Implications of Brick and Mortar Silicon Manufacturing.In ISCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay.Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory.In ISCA, 2016.Google ScholarGoogle Scholar
  36. O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ranganathan.Meet the Walkers: Accelerating Index Traversals for In-memory Databases.In MICRO, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. K. Lim, D. Meisner, A. Saidi, P. Ranganathan, and T. Wenisch.Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached.In ISCA, 2013.Google ScholarGoogle Scholar
  38. S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen.Cambricon: An Instruction Set Architecture for Neural Networks.In ISCA, 2016.Google ScholarGoogle Scholar
  39. I. Magaki, M. Khazraee, L. Vega, M. B. Taylor.ASIC Clouds: Specializing the Datacenter.In ISCA, 2016.Google ScholarGoogle Scholar
  40. M. Ozdal, S. Yesil, T. Kim, A. Ayupov, J. Greth, S. Burns, and O. Ozturk.Energy efficient architecture for graph analytics accelerators.In ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. A. Putnam, A. Caulfield, E. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. Gopal, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J. Kim, S. Lanka, J. Larus, E. Peterson, S. Pope, A. Smith, J. Thong, P. Xiao, and D. Burger.A Reconfigurable Fabric for Accelerating Large-scale Datacenter Services.In ISCA, 2014. Google ScholarGoogle ScholarCross RefCross Ref
  42. W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. Horowitz.Convolution engine: balancing efficiency and flexibility in specialized computing.In ISCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee, S. Lee, J. Hernández-Lobato, G. Wei, and D. Brooks.Minerva: enabling low-power, highly-accurate deep neural network accelerators.In ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. J. Sampson, G. Venkatesh, N. Goulding-Hotta, S. Garcia, S. Swanson and M. Taylor.Efficient Complex Operators for Irregular Codes.In HPCA, 2011. Google ScholarGoogle ScholarCross RefCross Ref
  45. R. Sampson, M. Yang, S. Wei, C. Chakrabarti, and T. Wenisch.Sonic Millip3De: A Massively Parallel 3D-Stacked Accelerator for 3D Ultrasound.In HPCA, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. Strachan, M. Hu, R. Williams, and V. Srikumar.ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars.In ISCA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Y. Shao, B. Reagen, G. Wei, and D. Brooks.Aladdin: a Pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures.In ISCA, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. D. Shaw, M. Deneroff, R. Dror, J. Kuskin, R. Larson, J. Salmon, C. Young, B. Batson, K. Bowers, J. Chao, M. Eastwood, J. Gagliardo, J. Grossman, C. Ho, D. Ierardi, I. Kolossváry, J. Klepeis, T. Layman, C. McLeavey, M. Moraes, R. Mueller, E. Priest, Y. Shan, J. Spengler, M. Theobald, B. Towles, and S. Wang.Anton, a Special-purpose Machine for Molecular Dynamics Simulation.In ISCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. A. Solomatnikov, A. Firoozshahian, W. Qadeer, O. Shacham, K. Kelley, Z. Asgar, M. Wachs, R. Hameed, and M. Horowitz.Chip Multi-processor Generator.In DAC, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. A. Pedram, S. Richardson, S. Galal, S. Kvatinsky, and M. Horowitz.Dark Memory and Accelerator-Rich System Optimization in the Dark Silicon Era.In IEEE Design Test, 2016.Google ScholarGoogle Scholar
  51. P. Tandon, J. Chang, R. Dreslinski, V. Qazvinian, P. Ranganathan, and T. Wenisch.Hardware Acceleration for Similarity Measurement in Natural Language Processing.In ISLPED, 2013. Google ScholarGoogle ScholarCross RefCross Ref
  52. M. Taylor.A Landscape of the New Dark Silicon Design Regime.In IEEE Micro, 2013. Google ScholarGoogle ScholarCross RefCross Ref
  53. M. Taylor.Is Dark Silicon Useful? Harnessing the Four Horsemen of the Coming Dark Silicon Apocalypse.In DAC, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. M. Taylor.Bitcoin and the Age of Bespoke Silicon.In CASES, 2013.Google ScholarGoogle Scholar
  55. G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. Taylor.Conservation cores: reducing the energy of mature computationsIn ASPLOS, 2010.Google ScholarGoogle Scholar
  56. G. Venkatesh, J. Sampson, N. Goulding-Hotta, S. Kota Venkata, M. Taylor, and S. Swanson.QsCores: Configurable Co-processors to Trade Dark Silicon for Energy Efficiency in a Scalable Manner.In MICRO, 2011.Google ScholarGoogle Scholar
  57. M. Wachs, O. Shacham, Z. Asgar, A. Firoozshahian, S. Richardson and M. Horowitz.Bringing up a chip on the cheap.\ IEEE Design Test of Computers, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. J. Wong, F. Kourshanfar and M. Potkonjak.Flexible ASIC: shared masking for multiple media processors.In DAC, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. K. Wu, and Y. Tsai.Structured ASIC, Evolution or Revolution?.In Proceedings of the International Symposium on Physical Design (ISPD), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. L. Wu, A. Lottarini, T. Paine, M. Kim, and K. Ross.Q100: The Architecture and Design of a Database Processing Unit.In ASPLOS, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. N. Xu, X. Cai, R. Gao, L. Zhang, and F. Hsu.FPGA Acceleration of RankBoost in Web Search Engines.In ACM Transactions on Reconfigurable Technology and Systems (TRETS), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. R. Yazdani, A. Segura, J. Arnau, and A. Gonzalez.An ultra low-power hardware accelerator for automatic speech recognition.In MICRO, 2016. Google ScholarGoogle ScholarCross RefCross Ref
  63. B. Zahiri.Structured ASICs: opportunities and challenges.In Proceedings of the 21st International Conference on Computer Design (ICCD), 2003. Google ScholarGoogle ScholarCross RefCross Ref
  64. S. Zhang, Z. Du, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen.Cambricon-X: An accelerator for sparse neural networks.In MICRO, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Moonwalk: NRE Optimization in ASIC Clouds

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems
              April 2017
              856 pages
              ISBN:9781450344654
              DOI:10.1145/3037697

              Copyright © 2017 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 4 April 2017

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              ASPLOS '17 Paper Acceptance Rate53of320submissions,17%Overall Acceptance Rate535of2,713submissions,20%

              Upcoming Conference

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader