Abstract
The movement of large quantities of data during the training of a deep neural network presents immense challenges for machine learning workloads, especially those based on future functional memories deployed to store network models. As the size of network models begins to vastly outstrip traditional silicon computing resources, functional memories based on flash, resistive switches, magnetic tunnel junctions, and other technologies can store these new ultra-large models. However, new approaches are then needed to minimize hardware overhead, especially on the movement and calculation of gradient information that cannot be efficiently contained in these new memory resources. To do this, we introduce streaming batch principal component analysis (SBPCA) as an update algorithm. Streaming batch principal component analysis uses stochastic power iterations to generate a stochastic rank-k approximation of the network gradient. We demonstrate that the low-rank updates produced by streaming batch principal component analysis can effectively train convolutional neural networks on a variety of common datasets, with performance comparable to standard mini-batch gradient descent. Our approximation is made in an expanded vector form that can efficiently be applied to the rows and columns of crossbars for array-level updates. These results promise improvements in the design of application-specific integrated circuits based around large vector-matrix multiplier memories.
- [1] . 2016. 3-D memristor crossbars for analog and neuromorphic computing applications. IEEE Transactions on Electron Devices 64, 1 (2016), 312–318.Google ScholarCross Ref
- [2] . 2018. Challenges hindering memristive neuromorphic hardware from going mainstream. Nature Communications 9, 1 (2018), 1–4.Google ScholarCross Ref
- [3] . 2017. First efficient convergence for streaming k-PCA: A global, gap-free, and near-optimal rate. In Proceedings of the 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS’17). IEEE Los Alamitos, CA, 487–492.Google ScholarCross Ref
- [4] . 2018. Equivalent-accuracy accelerated neural-network training using analogue memory. Nature 558, 7708 (2018), 60–67.Google ScholarCross Ref
- [5] . 2021. Reconfigurable and dense analog circuit design using two terminal resistive memory. IEEE Transactions on Emerging Topics in Computing 9, 3 (2021), 1596–1608.Google ScholarCross Ref
- [6] . 2016. An improved gap-dependency analysis of the noisy power method. In Proceedings of the Conference on Learning Theory. 284–309.Google Scholar
- [7] . 2017. Memristor-based perceptron classifier: Increasing complexity and coping with imperfect hardware. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’17).549–554.Google Scholar
- [8] . 2019. Monolithic 3-D integration. IEEE Micro 39, 6 (2019), 16–27.Google ScholarCross Ref
- [9] . 2020. Foundations of Data Science. Cambridge University Press.Google ScholarCross Ref
- [10] . 2017. A neuromorph’s prospectus. Computing in Science & Engineering 19, 2 (2017), 14–28.Google ScholarDigital Library
- [11] . 2010. Write endurance in flash drives: Measurements and analysis. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10). 115–128.Google Scholar
- [12] . 2020. Spectrum dependent learning curves in kernel regression and wide neural networks. In Proceedings of the International Conference on Machine Learning. 1024–1034.Google Scholar
- [13] . 2022. Spectral bias outside the training set for deep networks in the kernel regime. arXiv preprint arXiv:2206.02927 (2022).Google Scholar
- [14] . 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33 (2020), 1877–1901.Google Scholar
- [15] . 2020. Accelerating deep neural networks with analog memory devices. In Proceedings of the International Conference on Artificial Intelligence Circuits and Systems (AICAS’20). IEEE, Los Alamitos, CA, 149–152.Google Scholar
- [16] . 2015. Experimental demonstration and tolerancing of a large-scale neural network (165 000 synapses) using phase-change memory as the synaptic weight element. IEEE Transactions on Electron Devices 62, 11 (2015), 3498–3507.Google ScholarCross Ref
- [17] . 2021. Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature Communications 12 (2021), Article 2914.Google ScholarCross Ref
- [18] . 2016. Nanoelectronic neurocomputing: Status and prospects. In Proceedings of the 74th Annual Device Research Conference (DRC’16).Google Scholar
- [19] . 2017. A multiply-add engine with monolithically integrated 3D memristor crossbar/CMOS hybrid circuit. Scientific Reports 7 (2017), 42429.Google ScholarCross Ref
- [20] . 2020. Deep neural network acceleration based on low-rank approximated channel pruning. IEEE Transactions on Circuits and Systems I: Regular Papers 67, 4 (2020), 1232–1244.Google ScholarCross Ref
- [21] . 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Los Alamitos, CA, 248–255.Google ScholarCross Ref
- [22] . 2021. A 3-D reconfigurable RRAM crossbar inference engine. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’21).Google ScholarCross Ref
- [23] . 2018. Analog computation in flash memory for datacenter-scale AI inference in a small chip. In Proceedings of the 2018 Symposium on High Performance Chips (Hot Chips’18).Google Scholar
- [24] . 2019. Parallel programming of an ionic floating-gate memory array for scalable neuromorphic computing. Science 364, 6440 (2019), 570–574.Google ScholarCross Ref
- [25] . 2020. Batch training for neuromorphic systems with device non-idealities. In Proceedings of the International Conference on Neuromorphic Systems (ICONS’20).Google ScholarDigital Library
- [26] . 2017. Compressing deep neural networks for efficient visual inference. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo (ICME’17). IEEE, Los Alamitos, CA, 667–672.Google ScholarCross Ref
- [27] . 2016. Acceleration of deep neural network training with resistive cross-point devices: Design considerations. Frontiers in Neuroscience 10 (2016), 333.Google ScholarCross Ref
- [28] . 2018. On the computational inefficiency of large batch sizes for stochastic gradient descent. arXiv preprint arXiv:1811.12941 (2018).Google Scholar
- [29] . 2012. Matrix Computations. JHU Press.Google Scholar
- [30] . 2017. Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677 (2017).Google Scholar
- [31] . 2019. Compression of deep neural networks by combining pruning and low rank decomposition. In Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’19). IEEE, Los Alamitos, CA, 952–958.Google ScholarCross Ref
- [32] . 2012. Online nonnegative matrix factorization with robust stochastic approximation. IEEE Transactions on Neural Networks and Learning Systems 23, 7 (2012), 1087–1099.Google ScholarCross Ref
- [33] . 2020. Low-rank training of deep neural networks for emerging memory technology. arXiv preprint arXiv:2009.03887 (2020).Google Scholar
- [34] . 2014. The noisy power method: A meta algorithm with applications. In Advances in Neural Information Processing Systems (NIPS’14). 2861–2869.Google Scholar
- [35] . 2019. Streaming batch eigenupdates for hardware neural networks. Frontiers in Neuroscience 13 (2019), 793.Google ScholarCross Ref
- [36] . 2016. Resistive CAM acceleration for tunable approximate computing. IEEE Transactions on Emerging Topics in Computing 7, 2 (2016), 271–280.Google ScholarCross Ref
- [37] . 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1–12.Google ScholarDigital Library
- [38] . 2019. Towards the development of analog neuromorphic chip prototype with 2.4 M integrated memristors. In Proceedings of the 2019 IEEE International Symposium on Circuits and Systems (ISCAS’19). IEEE, Los Alamitos, CA, 1–5.Google ScholarCross Ref
- [39] . 2013. Filament scaling forming technique and level-verify-write scheme with endurance over \(10^7\) cycles in ReRAM. In Proceedings of the 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers. IEEE, Los Alamitos, CA, 220–221.Google ScholarCross Ref
- [40] . 2009. Learning Multiple Layers of Features from Tiny Images. Master’s thesis. University of Toronto, Toronto, Canada.Google Scholar
- [41] . 2012. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS’12). 1097–1105.Google Scholar
- [42] . 1950. An iteration method for the solution of the eigenvalue problem of linear differential and integral operators. Journal of Research of the National Bureau of Standards 45, 4 (1950), 255–282.Google ScholarCross Ref
- [43] . 2016. Rivalry of two families of algorithms for memory-restricted streaming PCA. Proceedings of Machine Learning Research 51 (2016), 473–481.Google Scholar
- [44] . 2014. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems (NIPS’14). 19–27.Google Scholar
- [45] . 2018. Capacitor-based cross-point array for analog neural network with record symmetry and linearity. In Proceedings of the 2018 IEEE Symposium on VLSI Technology. IEEE, Los Alamitos, CA, 25–26.Google ScholarCross Ref
- [46] . 2020. Three-dimensional memristor circuits as complex neural networks. Nature Electronics 3, 4 (2020), 225–232.Google ScholarCross Ref
- [47] . 2017. Deep gradient compression: Reducing the communication bandwidth for distributed training. ICLR 2018 proceedingsGoogle Scholar
- [48] . 1989. On the limited memory BFGS method for large scale optimization. Mathematical Programming 45, 1-3 (1989), 503–528.Google ScholarCross Ref
- [49] . 2021. Defect analysis and parallel testing for 3D hybrid CMOS-memristor memory. IEEE Transactions on Emerging Topics in Computing 8, 2 (2021), 745–758.Google ScholarCross Ref
- [50] . 2020. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Computing Surveys 53, 1 (2020), 1–37.Google ScholarDigital Library
- [51] . 2021. Efficient-CapsNet: Capsule network with self-attention routing. Scientific Reports 11 (2021), Article 14634.Google ScholarCross Ref
- [52] . 2019. Discovering low-precision networks close to full-precision networks for efficient inference. In Proceedings of the 2019 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS’19). IEEE, Los Alamitos, CA, 6–9.Google ScholarCross Ref
- [53] . 2013. Memory limited, streaming PCA. In Advances in Neural Information Processing Systems (NIPS’13). 2886–2894.Google Scholar
- [54] . 2015. Tensorizing neural networks. In Advances in Neural Information Processing Systems (NIPS’15).442–450.Google Scholar
- [55] . 1982. Simplified neuron model as a principal component analyzer. Journal of Mathematical Biology 15, 3 (1982), 267–273.Google ScholarCross Ref
- [56] . 1992. Principal components, minor components, and linear neural networks. Neural Networks 5, 6 (1992), 927–935.Google ScholarDigital Library
- [57] . 1985. On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. Journal of Mathematical Analysis and Applications 106, 1 (1985), 69–84.Google ScholarCross Ref
- [58] . 2020. Memristor crossbar arrays with 6-nm half-pitch and 2-nm critical dimension. Nature Nanotechnology 14, 1 (2020), 35–39.Google ScholarCross Ref
- [59] . 2015. Training and operation of an integrated neuromorphic network based on metal-oxide memristors. Nature 521, 7550 (2015), 61–64.Google ScholarCross Ref
- [60] . 2019. On the spectral bias of neural networks. In Proceedings of the International Conference on Machine Learning. 5301–5310.Google Scholar
- [61] . 2021. Does the data induce capacity control in deep learning? In Proceedings of the International Conference on Machine Learning. 25166–25197.Google Scholar
- [62] . 2019. Kafnets: Kernel-based non-parametric activation functions for neural networks. Neural Networks 110 (2019), 19–32.Google ScholarCross Ref
- [63] . 1989. A storage-efficient WY representation for products of householder transformations. SIAM Journal on Scientific and Statistical Computing 10, 1 (1989), 53–57.Google ScholarDigital Library
- [64] . 2019. Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research 20 (2019), 1–49.Google Scholar
- [65] . 2017. Don’t decay the learning rate, increase the batch size. ICLR 2018 proceedings. (2017)Google Scholar
- [66] . 2017. PipeLayer: A pipelined ReRAM-based accelerator for deep learning. In Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA’17). IEEE, Los Alamitos, CA, 541–552.Google ScholarCross Ref
- [67] . 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929–1958.Google ScholarDigital Library
- [68] . 1997. Bi-iteration SVD subspace tracking algorithms. IEEE Transactions on Signal Processing 45, 5 (1997), 1222–1240.Google ScholarDigital Library
- [69] . 2018. Is robustness the cost of accuracy?–A comprehensive study on the robustness of 18 deep image classification models. In Proceedings of the European Conference on Computer Vision (ECCV’18). 631–648.Google ScholarDigital Library
- [70] . 2013. Bio-inspired stochastic computing using binary CBRAM synapses. IEEE Transactions on Electron Devices 60, 7 (2013), 2402–2409.Google ScholarCross Ref
- [71] . 2017. Attention is all you need.In Advances in Neural Information Processing Systems(NIPS’17).Google Scholar
- [72] . 2019. PowerSGD: Practical low-rank gradient compression for distributed optimization. In Advances In Neural Information Processing Systems (NIPS’19).Google Scholar
- [73] . 2020. FFT-based gradient sparsification for the distributed training of deep neural networks. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing. 113–124.Google ScholarDigital Library
- [74] . 2019. Characterizing deep learning training workloads on Alibaba-PAI. arXiv preprint arXiv:1910.05930 (2019).Google Scholar
- [75] . 2019. In situ training of feed-forward and recurrent convolutional memristor networks. Nature Machine Intelligence 1 (2019), 434–442.Google ScholarCross Ref
- [76] . 2017. TernGrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in Neural Information Processing Systems (NIPS’17). 1509–1519.Google Scholar
- [77] . 2020. Analog architectures for neural network acceleration based on non-volatile memory. Applied Physics Reviews 7, 3 (2020), 031301.Google ScholarCross Ref
- [78] . 1995. An extension of the PASTd algorithm to both rank and subspace tracking. IEEE Signal Processing Letters 2, 9 (1995), 179–182.Google ScholarCross Ref
- [79] . 2018. History PCA: A new algorithm for streaming PCA. arXiv preprint arXiv:1802.05447 (2018).Google Scholar
- [80] . 2021. RRAM for compute-in-memory: From inference to training. IEEE Transactions on Circuits and Systems I 68, 7 (2021), 2753–2765.Google ScholarCross Ref
- [81] . 2019. DropAttention: A regularization method for fully-connected self-attention networks. arXiv preprint arXiv:1907.11065 (2019).Google Scholar
Index Terms
- Low-Rank Gradient Descent for Memory-Efficient Training of Deep In-Memory Arrays
Recommendations
A durable and energy efficient main memory using phase change memory technology
ISCA '09: Proceedings of the 36th annual international symposium on Computer architectureUsing nonvolatile memories in memory hierarchy has been investigated to reduce its energy consumption because nonvolatile memories consume zero leakage power in memory cells. One of the difficulties is, however, that the endurance of most nonvolatile ...
A durable and energy efficient main memory using phase change memory technology
Using nonvolatile memories in memory hierarchy has been investigated to reduce its energy consumption because nonvolatile memories consume zero leakage power in memory cells. One of the difficulties is, however, that the endurance of most nonvolatile ...
Write-once-memory-code phase change memory
DATE '14: Proceedings of the conference on Design, Automation & Test in EuropeThis paper describes a write-once-memory-code phase change memory (WOM-code PCM) architecture for next-generation non-volatile memory applications. Specifically, we address the long latency of the write operation in PCM --- attributed to PCM SET --- by ...
Comments