skip to main content
10.1145/3123939.3123979acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article
Public Access

Scale-out acceleration for machine learning

Published: 14 October 2017 Publication History

Abstract

The growing scale and complexity of Machine Learning (ML) algorithms has resulted in prevalent use of distributed general-purpose systems. In a rather disjoint effort, the community is focusing mostly on high performance single-node accelerators for learning. This work bridges these two paradigms and offers CoSMIC, a full computing stack constituting language, compiler, system software, template architecture, and circuit generators, that enable programmable acceleration of learning at scale. CoSMIC enables programmers to exploit scale-out acceleration using FPGAs and Programmable ASICs (P-ASICs) from a high-level and mathematical Domain-Specific Language (DSL). Nonetheless, CoSMIC does not require programmers to delve into the onerous task of system software development or hardware design. CoSMIC achieves three conflicting objectives of efficiency, automation, and programmability, by integrating a novel multi-threaded template accelerator architecture and a cohesive stack that generates the hardware and software code from its high-level DSL. CoSMIC can accelerate a wide range of learning algorithms that are most commonly trained using parallel variants of gradient descent. The key is to distribute partial gradient calculations of the learning algorithms across the accelerator-augmented nodes of the scale-out system. Additionally, CoSMIC leverages the parallelizability of the algorithms to offer multi-threaded acceleration within each node. Multi-threading allows CoSMIC to efficiently exploit the numerous resources that are becoming available on modern FPGAs/P-ASICs by striking a balance between multi-threaded parallelism and single-threaded performance. CoSMIC takes advantage of algorithmic properties of ML to offer a specialized system software that optimizes task allocation, role-assignment, thread management, and internode communication. We evaluate the versatility and efficiency of CoSMIC for 10 different machine learning applications from various domains. On average, a 16-node CoSMIC with UltraScale+ FPGAs offers 18.8× speedup over a 16-node Spark system with Xeon processors while the programmer only writes 22--55 lines of code. CoSMIC offers higher scalability compared to the state-of-the-art Spark; scaling from 4 to 16 nodes with CoSMIC yields 2.7× improvements whereas Spark offers 1.8×. These results confirm that the full-stack approach of CoSMIC takes an effective and vital step towards enabling scale-out acceleration for machine learning.

References

[1]
Apache Spark, 2017. URL https://spark.apache.org/.
[2]
Apache Hadoop, 2017. URL http://hadoop.apache.org/.
[3]
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In ASPLOS, 2014.
[4]
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, et al. DaDianNao: A machine-learning supercomputer. In MICRO, 2014.
[5]
Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. ShiDianNao: shifting vision processing closer to the sensor. In ISCA, 2015.
[6]
Daofu Liu, Tianshi Chen, Shaoli Liu, Jinhong Zhou, Shengyuan Zhou, Olivier Teman, Xiaobing Feng, Xuehai Zhou, and Yunji Chen. PuDianNao: A polyvalent machine learning accelerator. In ASPLOS, 2015.
[7]
Brandon Reagen, Paul Whatmough, Robert Adolf, Saketh Rama, Hyunkwang Lee, Sae Kyu Lee, Jose Miguel Hernandez-Lobato, Gu-Yeon Wei, and David Brooks. Minerva: Enabling low-power, highly-accurate deep neural network accelerators. In ISCA, 2016.
[8]
Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A. Horowitz, and William J. Dally. EIE: efficient inference engine on compressed deep neural network. CoRR, abs/1602.01528, 2016. URL http://arxiv.org/abs/1602.01528.
[9]
Yu-Hsin Chen, Joel Emer, and Vivienne Sze. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In ISCA, 2016.
[10]
Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, and Andreas Moshovos. Cnvlutin: Ineffectual-neuron-free deep neural network computing. In ISCA, 2016.
[11]
Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Misra, and Hadi Esmaeilzadeh. From high-level deep neural models to fpgas. In MICRO, October 2016.
[12]
Divya Mahajan, Jongse Park, Emmanuel Amaro, Hardik Sharma, Amir Yazdanbakhsh, Joon Kyung Kim, and Hadi Esmaeilzadeh. TABLA: A unified template-based framework for accelerating statistical machine learning. In HPCA, 2016.
[13]
Snickerdoodle: Affordable FPGA platform for powering everything robots, drones, and computer vision, 2017. URL http://krtkl.com/.
[14]
Andrew Putnam, Adrian Caulfield, Eric Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James R. Larus, Eric Peterson, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. A reconfigurable fabric for accelerating large-scale datacenter services. In ISCA, June 2014.
[15]
Adrian M. Caulfield, EricS. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. A cloud-scale acceleration architecture. In MICRO, 2016.
[16]
Amazon EC2 F1 instances: Run custom FPGAs in the AWS cloud, 2017. URL https://aws.amazon.com/ec2/instance-types/f1/.
[17]
Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher Ré. Towards a unified architecture for in-rdbms analytics. In SIGMOD.
[18]
Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
[19]
Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. On parallelizability of stochastic gradient descent for speech dnns. In ICASSP, 2014.
[20]
Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola. Parallelized stochastic gradient descent. In NIPS, 2010.
[21]
Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13 (Jan):165--202, 2012.
[22]
J. Langford, A.J. Smola, and M. Zinkevich. Slow learners are fast. In NIPS, 2009.
[23]
Gideon Mann, Ryan McDonald, Mehryar Mohri, Nathan Silberman, and Daniel D. Walker. Efficient large-scale distributed training of conditional maximum entropy models. In NIPS, 2009.
[24]
Dipankar Das, Sasikanth Avancha, Dheevatsa Mudigere, Karthikeyan Vaidynathan, Srinivas Sridharan, Dhiraj Kalamkar, Bharat Kaul, and Pradeep Dubey. Distributed deep learning using synchronous stochastic gradient descent. arXiv:1602.06709 {cs}, 2016.
[25]
Jianmin Chen, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. Revisiting distributed synchronous SGD. In ICLR Workshop Track, 2016.
[26]
Intel Altera. Arria 10 architecture, 2017. URL https://www.altera.com/products/fpga/arria-serqies/arria-10/features.html.
[27]
Spark MLlib: Apache spark's scalable machine learning library. URL http://spark.apache.org/mllib/.
[28]
Minsoo Rhu, Natalia Gimelshein, Jason Clemons, Arslan Zulfiqar, and Stephen W. Keckler. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design. In MICRO, 2016.
[29]
Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M. Aamodt, and Andreas Moshovos. Stripes: Bit-serial deep neural network computing. In MICRO, 2016.
[30]
Shijin Zhang, Zidong Du, Lei Zhang, Huiying Lan, Shaoli Liu, Ling Li, Qi Guo, Tianshi Chen, and Yunji Chen. Cambricon-x: An accelerator for sparse neural networks. In MICRO, 2016.
[31]
Yu Ji, Youhui Zhang, Shuangchen Li, and Ping Chi. NEUTRAMS: Neural network transformation and co-design under neuromorphic hardware constraints. In MICRO, 2016.
[32]
Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. Fused-layer cnn accelerators. In MICRO, 2016.
[33]
Ioannis Stamoulias and Elias S. Manolakos. Parallel architectures for the knn classifier - design of soft ip cores and fpga implementations. ACM Transactions on Embedded Computer Systems, 13(2):22:1--22:21, September 2013.
[34]
Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. Neural acceleration for general-purpose approximate programs. In MICRO, 2012.
[35]
Thierry Moreau, Mark Wyse, Jacob Nelson, Adrian Sampson, Hadi Esmaeilzadeh, Luis Ceze, and Mark Oskin. SNNAP: Approximate computing on programmable socs via neural acceleration. In HPCA, 2015.
[36]
E.S. Manolakos and I. Stamoulias. Ip-cores design for the knn classifier. In ISCAS, 2010.
[37]
H.M. Hussain, K. Benkrid, H. Seker, and A.T. Erdogan. Fpga implementation of k-means algorithm for bioinformatics application: An accelerated approach to clustering microarray data. In AHS, 2011.
[38]
Tsutomu Maruyama. Real-time k-means clustering for color images on reconfigurable hardware. In ICPR, 2006.
[39]
A.Gda.S. Filho, A.C. Frery, C.C. de Araujo, H. Alice, J. Cerqueira, J.A. Loureiro, M.E. de Lima, Mdas.G.S. Oliveira, and M.M. Horta. Hyperspectral images clustering on reconfigurable hardware using the k-means algorithm. In SBCCI, 2003.
[40]
M. Papadonikolakis and C. Bouganis. A heterogeneous fpga architecture for support vector machine training. In FCCM, 2010.
[41]
S. Cadambi, I. Durdanovic, V. Jakkula, M. Sankaradass, E. Cosatto, S. Chakradhar, and H.P. Graf. A massively parallel fpga-based coprocessor for support vector machines. In FCCM, 2009.
[42]
A. Majumdar, S. Cadambi, and S.T. Chakradhar. An energy-efficient heterogeneous system for embedded learning and classification. IEEE Embedded Systems Letters, 3(1):42--45, March 2011.
[43]
Abhinandan Majumdar, Srihari Cadambi, Michela Becchi, Srimat T. Chakradhar, and Hans Peter Graf. A massively parallel, energy efficient programmable accelerator for learning and classification. ACM Transactions on Architecture and Code Optimization, 9(1):6:1--6:30, Marcg 2012.
[44]
C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun. Neuflow: A runtime reconfigurable dataflow processor for vision. In CVPRW, 2011.
[45]
Antonio Roldao and George A. Constantinides. A high throughput fpga-based floating point conjugate gradient implementation for dense matrices. ACM Transactions on Reconfigurable Technology System, 3(1), January 2010.
[46]
G.R. Morris, V.K. Prasanna, and R.D. Anderson. A hybrid approach for mapping conjugate gradient onto an fpga-augmented reconfigurable supercomputer. In FCCM, 2006.
[47]
D. DuBois, A. DuBois, T. Boorman, C. Connor, and S. Poole. An implementation of the conjugate gradient algorithm on fpgas. In FCCM, 2008.
[48]
D. Kesler, B. Deka, and R. Kumar. A hardware acceleration technique for gradient descent and conjugate gradient. In SASP, 2011.
[49]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimizing fpga-based accelerator design for deep convolutional neural networks. In FPGA, 2015.
[50]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014.
[51]
Arvind K. Sujeeth, Kevin J. Brown, Hyoukjoong Lee, Tiark Rompf, Hassan Chafi, Martin Odersky, and Kunle Olukotun. Delite: A compiler architecture for performance-oriented embedded domain-specific languages. ACM Trans. Embed. Comput. Syst., 13(4s):134:1--134:25, April 2014.
[52]
David C Ku and Giovanni De Micheli. High level synthesis of ASICs under timing and synchronization constraints. Kluwer Academic Publishers, 1992.
[53]
Yann LeCun and Corinna Cortes. MNIST handwritten digit database, 2010. URL http://yann.lecun.com/exdb/mnist/.
[54]
A variant of mnist dataset with 8 millions records. URL https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#mnist8m.
[55]
Joel Praveen Pinto. Multilayer Perceptron Based Hierarchical Acoustic Modeling for Automatic Speech Recognition. PhD thesis, EPFL, 2010.
[56]
Bin Zhou. High-frequency data and volatility in foreign-exchange rates. Journal of Business & Economic Statistics, 14(1), 2008.
[57]
S Dhanya and Roshni VS Kumari. Comparison of various texture classification methods using multiresolution analysis and linear regression modelling. Springerplus, 5(54), 2016.
[58]
MR Segal, KD Dahlquist, and BR Conklin. Regression approaches for microarray data analysis. Journal of Computational Biology, 10(6), 2003.
[59]
D Singh, P Febbo, K Ross, D Jackson, J Manola, C Ladd, P Tamayo, A Renshaw, Amico A D, J Richie, E Lander, M Loda, P Kantoff, T Golub, and W Sellers. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2), 2002.
[60]
Iván Cantador, Peter Brusilovsky, and Tsvi Kuflik. Movielens dataset. In HetRec, 2011.
[61]
Grouplens. Movielens dataset, 2017. URL http://grouplens.org/datasets/movielens/.
[62]
Netflix Prize Data Set. URL http://www.netflixprize.com/.
[63]
J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, T. Poggiott, and V. Vapnik. Feature selection for svms. In NIPS, 2000.
[64]
Spark gpu and simd support. URL https://github.com/kiszk/spark-gpu.
[65]
Rajesh Bordawekar. Accelerating spark workloads using gpus. URL https://www.oreilly.com/learning/accelerating-spark-workloads-using-gpus.
[66]
GPUEnabler. URL https://github.com/ibmsparkgpu/gpuenabler.
[67]
CUDA-MLlib. URL https://github.com/ibmsparkgpu/cuda-mllib.
[68]
Andreas Athanasopoulos, Anastasios Dimou, Vasileios Mezaris, and Ioannis Kompatsiaris. GPU acceleration for support vector machines. In 12th International Workshop on Image Analysis for Multimedia Interactive Services, 2011.
[69]
Caffe2, 2017. URL https://github.com/caffe2/caffe2.
[70]
CUDA v8.0, 2017. URL https://developer.nvidia.com/cuda-toolkit.
[71]
cuDNN v7.0, 2017. URL https://developer.nvidia.com/cudnn.
[72]
Wattsup .net meter., 2017. URL http://www.wattsupmeters.com/.
[73]
Henry Hoffmann, Stelios Sidiroglou, Michael Carbin, Sasa Misailovic, Anant Agarwal, and Martin Rinard. Dynamic knobs for responsive power-aware computing. In ASPLOS, 2011.
[74]
Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J. Smola. Efficient mini-batch training for stochastic optimization. In KDD, 2014.
[75]
Andrew Cotter, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Better mini-batch algorithms via accelerated gradient methods. In NIPS, 2011.
[76]
Martin Takáč, Avleen Bijral, Peter Richtárik, and Nathan Srebro. Mini-batch primal and dual methods for svms. In ICML, 2013.
[77]
Ofer Dekel, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1):165--202, 2012.
[78]
Richard H. Byrd, Gillian M. Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical Programming, 134(1), 2012.
[79]
TABLA source code. URL http://www.act-lab.org/artifacts/tabla/.
[80]
Clément Farabet, Yann LeCun, Koray Kavukcuoglu, Eugenio Culurciello, Berin Martini, Polina Akselrod, and Selcuk Talay. Large-scale fpga-based convolutional networks. Machine Learning on Very Large Data Sets, 2011.
[81]
Chrilly Donninger, Alex Kure, and Ulf Lorenz. Parallel brutus: The first distributed, fpga accelerated chess program. In IPDPS, 2004.
[82]
John Paul Walters, Xiandong Meng, Vipin Chaudhary, Tim Oliver, Leow Yuan Yeow, Darran Nathan, Bertil Schmidt, and Joseph Landman. Mpi-hmmer-boost: Distributed fpga acceleration. Journal of VLSI Signal Processing, 48(3):223--238, 2007.
[83]
Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengil, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Christian Boehn, Oren Firestein, Alessandro Forin, Kang Su Gatlin, Mahdi Ghandi, Stephen Heil, Kyle Holohan, Tamas Juhasz, Ratna Kumar Kovvuri, Sitaram Lanka, Friedel van Megen, Dima Mukhortov, Prerak Patel, Steve Reinhardt, Adam Sapek, Raja Seera, Balaji Sridharan, Lisa Woods, Phillip Yi-Xiao, Ritchie Zhao, and Doug Burger. Accelerating persistent neural networks at datacenter scale. In HotChips, 2017.
[84]
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C. Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, and Doe Hyun Yoon. In-datacenter performance analysis of a tensor processing unit. In ISCA, 2017.
[85]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467 {cs}, 2016.
[86]
Shaoyi Cheng and John Wawrzynek. High Level Synthesis with a Dataflow Architectural Template. In OLAF, June 2016.
[87]
Eric S. Chung, John D. Davis, and Jaewon Lee. LINQits: Big data on little clients. In ISCA, 2013.
[88]
Andrew R. Putnam, Dave Bennett, Eric Dellinger, Jeff Mason, and Prasanna Sundararajan. CHiMPS: A high-level compilation flow for hybrid CPU-FPGA architectures. In FPGA, 2008.
[89]
Jason Cong, Bin Liu, Stephen Neuendorffer, Juanjo Noguera, Kees Vissers, and Zhiru Zhang. High-level synthesis for fpgas: From prototyping to deployment. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 30(4):473--491, 2011.
[90]
I. Ouaiss, S. Govindarajan, V. Srinivasan, and R. Vemuri. An integrated partitioning and synthesis system for dynamically reconfigurable multi-fpga architectures. Lecture Notes in Computer Science, 1385--1388:31--36, 1999.
[91]
Muhuan Huang, Di Wu, Cody Hao Yu, Zhenman Fang, Matteo Interlandi, Tyson Condie, and Jason Cong. Programming and runtime support to blaze fpga accelerator deployment at datacenter scale. In SoCC, 2016.
[92]
Zeke Wang, Shuhao Zhang, Bingsheng He, and Wei Zhang. Melia: A mapreduce framework on opencl-based fpgas. IEEE Transactions on Parallel and Distributed Systems, 27(12):3547--3560, 2016.
[93]
Dionysios Diamantopoulos and Christoforos Kachris. High-level synthesizable dataflow mapreduce accelerator for fpga-coupled data centers. In SAMOS, 2015.
[94]
Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI, 2004.

Cited By

View all
  • (2024)Heterogeneous Acceleration Pipeline for Recommendation System Training2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00081(1063-1079)Online publication date: 29-Jun-2024
  • (2024)Performance enhancement of artificial intelligence: A surveyJournal of Network and Computer Applications10.1016/j.jnca.2024.104034232(104034)Online publication date: Dec-2024
  • (2023)V10: Hardware-Assisted NPU Multi-tenancy for Improved Resource Utilization and FairnessProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589059(1-15)Online publication date: 17-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture
October 2017
850 pages
ISBN:9781450349529
DOI:10.1145/3123939
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. accelerator
  2. cloud
  3. distributed
  4. machine learning
  5. scale-out

Qualifiers

  • Research-article

Funding Sources

Conference

MICRO-50
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)390
  • Downloads (Last 6 weeks)43
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Heterogeneous Acceleration Pipeline for Recommendation System Training2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00081(1063-1079)Online publication date: 29-Jun-2024
  • (2024)Performance enhancement of artificial intelligence: A surveyJournal of Network and Computer Applications10.1016/j.jnca.2024.104034232(104034)Online publication date: Dec-2024
  • (2023)V10: Hardware-Assisted NPU Multi-tenancy for Improved Resource Utilization and FairnessProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589059(1-15)Online publication date: 17-Jun-2023
  • (2023)An Energy-Efficient Domain-Specific Architecture for Regular ExpressionsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2022.315794811:1(3-17)Online publication date: 1-Jan-2023
  • (2021)QEI: Query Acceleration Can be Generic and Efficient in the Cloud2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00040(385-398)Online publication date: Feb-2021
  • (2021)Hardware accelerator systems for artificial intelligence and machine learningHardware Accelerator Systems for Artificial Intelligence and Machine Learning10.1016/bs.adcom.2020.11.005(51-95)Online publication date: 2021
  • (2021)Deep Learning-Based Facial Recognition on Hybrid Architecture for Financial ServicesArtificial Intelligence for Cloud and Edge Computing10.1007/978-3-030-80821-1_3(51-70)Online publication date: 30-Jul-2021
  • (2020)Does domain name encryption increase users' privacy?ACM SIGCOMM Computer Communication Review10.1145/3411740.341174350:3(16-22)Online publication date: 22-Jul-2020
  • (2020)Tracking the deployment of TLS 1.3 on the webACM SIGCOMM Computer Communication Review10.1145/3411740.341174250:3(3-15)Online publication date: 22-Jul-2020
  • (2020)Harvesting-Aware Optimal Communication Scheme for Infrastructure-Less SensingACM Transactions on Internet of Things10.1145/33959281:4(1-26)Online publication date: 21-Jun-2020
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media