skip to main content
10.1145/3307650.3322226acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Cambricon-F: machine learning computers with fractal von neumann architecture

Published: 22 June 2019 Publication History

Abstract

Machine learning techniques are pervasive tools for emerging commercial applications and many dedicated machine learning computers on different scales have been deployed in embedded devices, servers, and data centers. Currently, most machine learning computer architectures still focus on optimizing performance and energy efficiency instead of programming productivity. However, with the fast development in silicon technology, programming productivity, including programming itself and software stack development, becomes the vital reason instead of performance and power efficiency that hinders the application of machine learning computers.
In this paper, we propose Cambricon-F, which is a series of homogeneous, sequential, multi-layer, layer-similar, machine learning computers with the same ISA. A Cambricon-F machine has a fractal von Neumann architecture to iteratively manage its components: it is with von Neumann architecture and its processing components (sub-nodes) are still Cambricon-F machines with von Neumann architecture and the same ISA. Since different Cambricon-F instances with different scales can share the same software stack on their common ISA, Cambricon-Fs can significantly improve the programming productivity. Moreover, we address four major challenges in Cambricon-F architecture design, which allow Cambricon-F to achieve a high efficiency. We implement two Cambricon-F instances at different scales, i.e., Cambricon-F100 and Cambricon-F1. Compared to GPU based machines (DGX-1 and 1080Ti), Cambricon-F instances achieve 2.82x, 5.14x better performance, 8.37x, 11.39x better efficiency on average, with 74.5%, 93.8% smaller area costs, respectively.

References

[1]
Google Inc., "Cloud vision: Derive insight from your images with our powerful pretrained API models or easily train custom vision models with AutoML Vision." https://www.ibm.com/thought-leadership/summit-supercomputer/.
[2]
B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang, "Acquisition of localization confidence for accurate object detection," Lecture Notes in Computer Science, p. 816--832, 2018.
[3]
A. Krizhevsky, G. E. Hinton, I. Sutskever, and G. E. Hinton, "ImageNet Classification with Deep Convolutional Neural Networks," tech. rep., 2012.
[4]
Google Inc., "Cloud speech-to-text: Speech-to-text conversion powered by machine learning and available for short-form or long-form audio." https://cloud.google.com/speech-to-text/.
[5]
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, "Wavenet: A generative model for raw audio," arXiv preprint arXiv:1609.03499, 2016.
[6]
Amazon, "Easily recognize famous individuals and celebrities using Amazon Rekognition." https://console.aws.amazon.com/rekognition/home.
[7]
E. Zhou, Z. Cao, and J. Sun, "Gridface: Face rectification via learning local homography transformations," Lecture Notes in Computer Science, p. 3--20, 2018.
[8]
Google Inc., "CLOUD VIDEO INTELLIGENCE: Search and discover your media content with Cloud Video Intelligence." https://cloud.google.com/video-intelligence/.
[9]
T. Mei and C. Zhang, "Deep learning for intelligent video analysis," October 2017.
[10]
S. Chaudhuri, G. Theocharous, and M. Ghavamzadeh, "Personalized advertisement recommendation: A ranking approach to address the ubiquitous click sparsity problem," CoRR, vol. abs/1603.01870, 2016.
[11]
T. Mahlmann, A. Drachen, J. Togelius, A. Canossa, and G. N. Yannakakis, "Predicting player behavior in Tomb Raider: Underworld," in Proceedings of the 2010 IEEE Conference on Computational Intelligence and Games, pp. 178--185, Aug 2010.
[12]
D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, F. Hui, L. Sifre, G. V. D. Driessche, T. Graepel, and D. Hassabis, "Mastering the game of Go without human knowledge," Nature, vol. 550, 2017.
[13]
Cambricon, "Cambricon 1H provides strong AI computing in Huawei Kirin 980."
[14]
Apple Inc., "Get Ready for Core ML 2." https://developer.apple.com/machine-learning/.
[15]
NVIDIA Corporation, "NVIDIA Tesla V100 GPU Architecture," 2018. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.
[16]
NVIDIA Corporation, "NVIDIA DGX-2H," 2018. https://www.nvidia.com/content/dam/en-zz/es_em/Solutions/Data-Center/dgx-2/dgx-2h-datasheet-us-nvidia-841283-r6-web.pdf.
[17]
Google Inc., "What makes TPUs fine-tuned for deep learning?," 2018. https://cloud.google.com/blog/products/ai-machine-learning/what-makes-tpus-fine-tuned-for-deep-learning.
[18]
IBM, "The most powerful computers on the planet." https://www.ibm.com/thought-leadership/summit-supercomputer/.
[19]
C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello, and Y. LeCun, "NeuFlow: A runtime reconfigurable dataflow processor for vision," in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 109--116, Ieee, jun 2011.
[20]
S. Venkataramani and V. Chippa, "Quality programmable vector processors for approximate computing," in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, no. i, pp. 1--12, 2013.
[21]
T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, "DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning," in Proceedings of the 19th international conference on Architectural support for programming languages and operating systems (ASPLOS), (Salt Lake City, UT, USA), pp. 269--284, 2014.
[22]
D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, and Y. Chen, "PuDianNao: A Polyvalent Machine Learning Accelerator," in Proceedings of the 20th international conference on Architectural support for programming languages and operating systems (ASPLOS), pp. 369--381, 2015.
[23]
Y.-H. Chen, J. Emer, and V. Sze, "Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks," in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 367--379, 2016.
[24]
B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, "Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural Network processor in 28nm FDSOI," IEEE International Solid-State Circuits Conference, vol. 60, pp. 246--247, 2017.
[25]
A. Biswas and A. P. Chandrakasan, "Conv-RAM: An energy-efficient SRAM with embedded convolution computation for low-power CNN-based machine learning applications," in IEEE International Solid-State Circuits Conference, vol. 61, pp. 488--490, 2018.
[26]
NVIDIA Corporation, "Parallel Thread Execution ISA Version 6.2," 2018. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.
[27]
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., "Tensorflow: a system for large-scale machine learning.," in OSDI, vol. 16, pp. 265--283, 2016.
[28]
Huawei, "Huawei Launches HiAI 2.0, Commits to Creating the Ultimate AI App Experience." https://www.huawei.com/en/press-events/news/2018/11/huawei-hiai-2-ultimate-ai-app-experience.
[29]
W. Sierpiński, "Sur une courbe cantorienne qui contient une image biunivoque et continue de toute courbe donnée," 1916.
[30]
M. T. Barlow and R. F. Bass, "The construction of brownian motion on the sierpinski carpet," Ann. Inst. H. Poincaré, vol. 25, no. 1989, pp. 225--257, 1989.
[31]
W. Wzr, V. Surfhvv, L. V. Ghsor, H. G. Rq, K. Hqg, D. Rq, P. D. Q. Fkdoohqjlqj, P. Ohduqlqj, H. J. L. W. Wdnhv, W. Zhhnv, W. R. Wudlq, R. Q. Irxu, and K. Hqg, "Towards Pervasive and User Staisfactory CNN across GPU Microarchitecture," in Proceedings of The 23rd IEEE Symposium on High Performance Computer Architecture (HPCA), 2017.
[32]
X. Zhang, C. Xie, J. Wang, W. Zhang, and X. Fu, "Towards Memory Friendly Long-Short Term Memory Networks (LSTMs) on Mobile GPUs," in Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, vol. 1537085, 2018.
[33]
P. Hill, A. Jain, M. Hill, B. Zamirai, C.-H. Hsu, M. A. Laurenzano, and S. Mahlke, "DeftNN: Addressing Bottlenecks for DNN Execution on GPUs via Synapse Vector Elimination and Near-compute Data Fission," in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 786--799, 2017.
[34]
J. Park, H. Sharma, D. Mahajan, J. K. Kim, P. Olds, and H. Esmaeilzadeh, "Scale-out acceleration for machine learning," in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 367--381, 2017.
[35]
Y. Shen, M. Ferdman, and P. Milder, "Maximizing CNN Accelerator Efficiency Through Resource Partitioning," in Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA'17), pp. 535--547, 2017.
[36]
T. Chen, S. Srinath, C. Batten, and G. E. Suh, "An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware," in Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, no. 2, 2018.
[37]
Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, "DaDianNao: A Machine-Learning Supercomputer," in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47), pp. 609--622, 2015.
[38]
Y. Chen, T. Chen, X. Zhiwei, and O. Temam, "DianNao Family: Energy-Efficient Hardware Accelerators for Machine Learning," Communications of the ACM, vol. 57, no. 5, p. 109, 2014.
[39]
Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, X. Feng, Y. Chen, and O. Temam, "ShiDianNao: Shifting Vision Processing Closer to the Sensor," in Proceedings of the 42nd Annual International Symposium on Computer Architecture, pp. 92--104, 2015.
[40]
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-L. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. Mackean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, "In-Datacenter Performance Analysis of a Tensor Processing Unit," in Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA'17), pp. 1--17, 2017.
[41]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "ImageNet: A Large-Scale Hierarchical Image Database," in CVPR09, 2009.
[42]
S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, "Cambricon: An Instruction Set Architecture for Neural Networks," 2016.
[43]
V. Vanhoucke, A. Senior, and M. Z. Mao, "Improving the speed of neural networks on CPUs," in Deep Learning and Unsupervised Feature Learning Workshop, Neural Information Processing Systems Conference (NIPS), 2011.
[44]
H. Esmaeilzadeh, P. Saeedi, B. Araabi, C. Lucas, and S. Fakhraie, "Neural Network Stream Processing Core (NnSP) for Embedded Systems," in 2006 IEEE International Symposium on Circuits and Systems (ISCS), pp. 2773--2776, Ieee, 2006.
[45]
X. Sun and L. Ni, "Scalable problems and memory-bounded speedup," Journal of Parallel and Distributed Computing, vol. 19, no. 1, pp. 27--37, 1993.
[46]
S. Williams, A. Waterman, and D. Patterson, "Roofline: An insightful visual performance model for multicore architectures," Commun. ACM, vol. 52, pp. 65--76, Apr. 2009.
[47]
P. Flajolet and M. Golin, "Exact asymptotics of divide-and-conquer recurrences," in Automata, Languages and Programming (A. Lingas, R. Karlsson, and S. Carlsson, eds.), (Berlin, Heidelberg), pp. 137--149, Springer Berlin Heidelberg, 1993.
[48]
M. Poremba, S. Mittal, D. Li, J. S. Vetter, and Y. Xie, "DESTINY: A tool for modeling emerging 3D NVM and eDRAM caches," in Proceedings of the 2015 Design, Automation & Test in Europe Conference, DATE '15, (San Jose, CA, USA), pp. 1543--1546, EDA Consortium, 2015.
[49]
K. Simonyan and A. Zisserman, "Very deep convolutional networks for large-scale image recognition," CoRR, vol. abs/1409.1556, 2014.
[50]
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[51]
NVIDIA Corporation, "CUDA Toolkit Documentation v9.0.176," 2018. https://docs.nvidia.com/cuda/archive/9.0/.
[52]
NVIDIA Corporation, "NVIDIA Deep Learning SDK," 2018. https://docs.nvidia.com/deeplearning/sdk/index.html.
[53]
V. Sathish, M. J. Schulte, and N. S. Kim, "Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads," in Parallel Architectures and Compilation Techniques (PACT), 2012 21st International Conference on, pp. 325--334, IEEE, 2012.
[54]
NVIDIA Corporation, "NVIDIA Tesla P100," 2017. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf.
[55]
P. Srivastava, M. Kang, S. K. Gonugondla, S. Lim, J. Choi, V. Adve, N. S. Kim, and N. Shanbhag, "PROMISE: An end-to-end design of a programmable mixed-signal accelerator for machine-learning algorithms," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 43--56, June 2018.
[56]
F. Tu, W. Wu, S. Yin, L. Liu, and S. Wei, "RANA: Towards efficient neural acceleration with refresh-optimized embedded DRAM," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 340--352, June 2018.
[57]
C. Eckert, X. Wang, J. Wang, A. Subramaniyan, R. Iyer, D. Sylvester, D. Blaaauw, and R. Das, "Neural cache: Bit-serial in-cache acceleration of deep neural networks," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 383--396, June 2018.
[58]
K. Hegde, J. Yu, R. Agrawal, M. Yan, M. Pellauer, and C. Fletcher, "UCNN: Exploiting computational reuse in deep neural networks via weight repetition," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 674--687, June 2018.
[59]
V. Akhlaghi, A. Yazdanbakhsh, K. Samadi, R. K. Gupta, and H. Esmaeilzadeh, "SnaPEA: Predictive early activation for reducing computation in deep convolutional neural networks," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 662--673, June 2018.
[60]
E. Park, D. Kim, and S. Yoo, "Energy-efficient neural network accelerator based on outlier-aware low-precision computation," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 688--698, June 2018.
[61]
H. Sharma, J. Park, N. Suda, L. Lai, B. Chau, V. Chandra, and H. Esmaeilzadeh, "Bit fusion: Bit-level dynamically composable architecture for accelerating deep neural network," in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 764--775, June 2018.
[62]
A. Arunkumar, E. Bolotin, B. Cho, U. Milic, E. Ebrahimi, O. Villa, A. Jaleel, C. Wu, and D. Nellans, "MCM-GPU: Multi-chip-module GPUs for continued performance scalability," in 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), pp. 320--332, June 2017.
[63]
A. Venkat, H. Basavaraj, and D. M. Tullsen, "Composite-ISA Cores: Enabling Multi-ISA Heterogeneity Using a Single ISA," in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 42--55, Feb 2019.

Cited By

View all

Index Terms

  1. Cambricon-F: machine learning computers with fractal von neumann architecture
            Index terms have been assigned to the content through auto-classification.

            Comments

            Information & Contributors

            Information

            Published In

            cover image ACM Conferences
            ISCA '19: Proceedings of the 46th International Symposium on Computer Architecture
            June 2019
            849 pages
            ISBN:9781450366694
            DOI:10.1145/3307650
            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Sponsors

            In-Cooperation

            • IEEE-CS\DATC: IEEE Computer Society

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            Published: 22 June 2019

            Permissions

            Request permissions for this article.

            Check for updates

            Qualifiers

            • Research-article

            Funding Sources

            • Beijing Natural Science Foundation
            • National Basic Research Program of China (973 Program)
            • National Science and Technology Major Project
            • National Key Research and Development Program of China
            • Key Research Projects in Frontier Science of Chinese Academy of Sciences
            • Transformation and Transfer of Scientific and Technological Achievements of Chinese Academy of Sciences
            • Strategic Priority Research Program of Chinese Academy of Science
            • National Natural Science Foundation of China

            Conference

            ISCA '19
            Sponsor:

            Acceptance Rates

            ISCA '19 Paper Acceptance Rate 62 of 365 submissions, 17%;
            Overall Acceptance Rate 543 of 3,203 submissions, 17%

            Upcoming Conference

            ISCA '25

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)74
            • Downloads (Last 6 weeks)3
            Reflects downloads up to 02 Mar 2025

            Other Metrics

            Citations

            Cited By

            View all
            • (2024)Architectures for Machine LearningHandbook of Computer Architecture10.1007/978-981-97-9314-3_12(321-379)Online publication date: 21-Dec-2024
            • (2023)Rescue to the Curse of universalityScience China Information Sciences10.1007/s11432-021-3596-x66:9Online publication date: 2-Aug-2023
            • (2022)Fractal Parallel ComputingIntelligent Computing10.34133/2022/97976232022Online publication date: 5-Sep-2022
            • (2022)ANT: Exploiting Adaptive Numerical Data Type for Low-Bit Deep Neural Network QuantizationProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00095(1414-1433)Online publication date: 1-Oct-2022
            • (2022)Cambricon-P: A Bitflow Architecture for Arbitrary Precision ComputingProceedings of the 55th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO56248.2022.00016(57-72)Online publication date: 1-Oct-2022
            • (2022)Architectures for Machine LearningHandbook of Computer Architecture10.1007/978-981-15-6401-7_12-1(1-59)Online publication date: 11-Aug-2022
            • (2021)Polyhedral-Based Compilation Framework for In-Memory Neural Network AcceleratorsACM Journal on Emerging Technologies in Computing Systems10.1145/346984718:1(1-23)Online publication date: 29-Sep-2021
            • (2021)AKG: automatic kernel generation for neural processing units using polyhedral transformationsProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454106(1233-1248)Online publication date: 19-Jun-2021
            • (2021)HASCOProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00086(1055-1068)Online publication date: 14-Jun-2021
            • (2020)DNNGuard: An Elastic Heterogeneous DNN Accelerator Architecture against Adversarial AttacksProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378532(19-34)Online publication date: 9-Mar-2020
            • Show More Cited By

            View Options

            Login options

            View options

            PDF

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            Figures

            Tables

            Media

            Share

            Share

            Share this Publication link

            Share on social media