Skip to main content

LPSD: Low-Rank Plus Sparse Decomposition for Highly Compressed CNN Models

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14645))

Included in the following conference series:

  • 186 Accesses

Abstract

Low-rank decomposition that explores and eliminates the linear dependency within a tensor is often used as a structured model pruning method for deep convolutional neural networks. However, the model accuracy declines rapidly as the compression ratio increases over a threshold. We have observed that with a small amount of sparse elements, the model accuracy can be recovered significantly for the highly compressed CNN models. Based on this premise, we developed a novel method, called LPSD (Low-rank Plus Sparse Decomposition), that decomposes a CNN weight tensor into a combination of a low-rank and a sparse components, which can better maintain the accuracy for the high compression ratio. For a pretrained model, the network structure of each layer is split into two branches: one for low-rank part and one for sparse part. LPSD adapts the alternating approximation algorithm to minimize the global error and the local error alternatively. An exhausted search method with pruning is designed to search the optimal group number, ranks, and sparsity. Experimental results demonstrate that in most scenarios, LPSD achieves better accuracy compared to the state-of-the-art methods when the model is highly compressed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Cai, J.F., Li, J., Xia, D.: Generalized low-rank plus sparse tensor estimation by fast Riemannian optimization (2022)

    Google Scholar 

  2. Chu, B.S., Lee, C.R.: Low-rank tensor decomposition for compression of convolutional neural networks using funnel regularization (2021)

    Google Scholar 

  3. Guo, K., Xie, X., Xu, X., Xing, X.: Compressing by learning in a low-rank and sparse decomposition form. IEEE Access 7, 150823–150832 (2019). https://doi.org/10.1109/ACCESS.2019.2947846

    Article  Google Scholar 

  4. Han, S., et al.: DSD: Dense-sparse-dense training for deep neural networks (2017)

    Google Scholar 

  5. Hawkins, C., Yang, H., Li, M., Lai, L., Chandra, V.: Low-rank+sparse tensor compression for neural networks (2021)

    Google Scholar 

  6. Huang, W., et al.: Deep low-rank plus sparse network for dynamic MR imaging (2021)

    Google Scholar 

  7. Idelbayev, Y., Carreira-Perpinan, M.A.: Low-rank compression of neural nets: learning the rank of each layer. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8046–8056 (2020). https://doi.org/10.1109/CVPR42600.2020.00807

  8. Kaloshin, P.: Convolutional neural networks compression with low rank and sparse tensor decompositions (2020)

    Google Scholar 

  9. Kim, Y.D., Park, E., Yoo, S., Choi, T., Yang, L., Shin, D.: Compression of deep convolutional neural networks for fast and low power mobile applications (2016)

    Google Scholar 

  10. Liang, C.C., Lee, C.R.: Automatic selection of tensor decomposition for compressing convolutional neural networks a case study on VGG-type networks. In: 2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 770–778 (2021). https://doi.org/10.1109/IPDPSW52791.2021.00115

  11. Liebenwein, L., Maalouf, A., Gal, O., Feldman, D., Rus, D.: Compressing neural networks: Towards determining the optimal layer-wise decomposition (2021). CoRR abs/2107.11442, https://arxiv.org/abs/2107.11442

  12. Lin, T., Stich, S.U., Barba, L., Dmitriev, D., Jaggi, M.: Dynamic model pruning with feedback (2020)

    Google Scholar 

  13. Otazo, R., Candès, E., Sodickson, D.: Low-rank plus sparse matrix decomposition for accelerated dynamic MRI with separation of background and dynamic components. Magn. Reson. Med. 73, 1125–1136 (2014). https://doi.org/10.1002/mrm.25240

  14. Yin, M., Phan, H., Zang, X., Liao, S., Yuan, B.: BATUDE: budget-aware neural network compression based on tucker decomposition. Proc. AAAI Conf. Artif. Intell. 36, 8874–8882 (2022). https://doi.org/10.1609/aaai.v36i8.20869

  15. Yu, X., Liu, T., Wang, X., Tao, D.: On compressing deep models by low rank and sparse decomposition. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 67–76 (2017). https://doi.org/10.1109/CVPR.2017.15

  16. Zhang, X., Wang, L., Gu, Q.: A unified framework for low-rank plus sparse matrix recovery (2018)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Che-Rung Lee .

Editor information

Editors and Affiliations

Appendix

Appendix

1.1 A. Optimality of Sparsity Selection

The optimality of sparsity selection method can be proven by the Eckart-Young-Mirsky theorem [11]. Let A be an \(m\times n\) matrix, and \(\textrm{nnz}(A)\) be the number of nonzero elements pf A. The norm used is Frobenius norm, whose definition is \(\Vert A\Vert _F = \sqrt{\sum _{i=1}^m \sum _{j=1}^n A_{i,j}^2}, \) where \(A_{i,j}\) is the (i, j)th element of A. The following lemma shows how to find the optimal sparse matrix S to minimize \(\Vert A-S\Vert _F\).

Lemma 1

Let A be an \(m\times n\) matrix. The solution to minimize \(\Vert A-S\Vert _F\) such that nnz(S)=s is the matrix T that contains only the largest s \(|A_{i,j}|\) elements at the same indices, and other elements are zeros.

The proof is straightforward, since \(\Vert A-S\Vert _F^2 = \sum _{i=1}^m\sum _{j=1}^n (A_{i,j}-S_{i,j})^2.\) It minimal value can be obtained by removing the s largest \(|A_{i,j}|\) elements, which is equivalent to make \((A_{i,j}-S_{i,j})=0\) for those largest elements in magnitude.

1.2 B. Error Estimation

Theorem 1

If a collection of data with size n is in normal distribution with mean 0, then top-k squares sum can be estimated by the formula:

$$ n \frac{-2\sigma ^ 2}{\sigma \sqrt{2 \pi }} \Biggr [ -t e^{\frac{-t^2}{2 \sigma ^ 2}} - \sigma \sqrt{2 \pi } + \sigma \sqrt{2 \pi } F_X(t) \Biggr ], $$

where \(t = F_X^{-1}\biggl (1 - \frac{k}{2n}\biggl )\), \(F_{X}(t)\) is the cumulative distribution function of \(f_{X}(t) = \frac{1}{\sigma \sqrt{2\pi }} e^\frac{-t^2}{2\sigma ^2}\), the probability density function of normal distribution with mean 0.

Proof

Let \(X \sim N(\mu =0, \sigma ^2)\) be the random variable of the data, the probability density function of X is \(f_{X}(t) = \frac{1}{\sigma \sqrt{2\pi }} e^\frac{-t^2}{2\sigma ^2}\). Now consider another random variable \(Y=X^2\). We can find out the probability density function of Y by:

$$ f_{Y}(y) = \frac{d}{dy}Pr(Y\le y) = \frac{d}{dy}Pr(-\sqrt{y} \le X \le \sqrt{y}) = \frac{d}{dy} \int _{-\sqrt{y}}^{\sqrt{y}} f_{X}(x) \ dx $$

We can rewrite \(f_Y(y)\) as:

$$\begin{aligned} &f_{Y}(y) = \frac{d}{dy} \int _{-\sqrt{y}}^{\sqrt{y}} f_{X}(x) \ dx\ = \frac{d}{dy} F_X(x)\Biggr |_{-\sqrt{y}}^{\sqrt{y}} \\ &= \frac{d}{dy} \biggl (F_X(\sqrt{y}) - F_X(-\sqrt{y})\biggl ) = f_X(\sqrt{y})\frac{1}{2\sqrt{y}} + f_X(-\sqrt{y})\frac{1}{2\sqrt{y}} = \frac{1}{\sqrt{y}} f_X(\sqrt{y}) \end{aligned}$$

After obtaining the probability density function of \(Y=X^2\), the kth largest square value in data can be found. Assume that the number is \(t^2 \ (t>0)\), then

$$\begin{aligned} &Pr(Y\le t^2) = 1 - \frac{k}{n} \\ &\Rightarrow Pr(-t \le X \le t) = 1 - \frac{k}{n} \\ &\Rightarrow Pr(X > t) = \frac{k}{2n} \Rightarrow Pr(X \le t) = 1 - \frac{k}{2n} \\ &\Rightarrow t = F_X^{-1}\biggl (1 - \frac{k}{2n}\biggl ) \end{aligned}$$

After obtaining the kth largest square value, \(t^2\), the average of top-k squares can be found by expected value:

$$ E[Y | Y\ge t^2] = \frac{\int _{t^2}^{\infty } yf_Y(y) \ dy}{\int _{t^2}^{\infty } f_Y(y) \ dy} = \frac{1}{\frac{k}{n}} \int _{t^2}^{\infty } \frac{y}{\sqrt{y}} f_X(\sqrt{y}) \ dy = \frac{1}{\frac{k}{n}} \int _{t^2}^{\infty } \sqrt{y} \frac{1}{\sigma \sqrt{2 \pi }} e^{\frac{-y}{2\sigma ^2}} \ dy $$

Focus on the integral part:

$$\begin{aligned} &\int _{t^2}^{\infty } \sqrt{y} \frac{1}{\sigma \sqrt{2 \pi }} e^{\frac{-y}{2\sigma ^2}} \ dy = \int _{t^2}^{\infty } \sqrt{y} \frac{-2\sigma ^ 2}{\sigma \sqrt{2 \pi }} e^{\frac{-y}{2\sigma ^2}} \ d\biggl (\frac{-y}{2\sigma ^2}\biggl ) = \frac{-2\sigma ^ 2}{\sigma \sqrt{2 \pi }} \int _{t^2}^{\infty } \sqrt{y} e^{\frac{-y}{2\sigma ^2}} \ d\biggl (\frac{-y}{2\sigma ^2}\biggl ) \\ &= \frac{-2\sigma ^ 2}{\sigma \sqrt{2 \pi }} \Biggr [ \sqrt{y} e^{\frac{-y}{2\sigma ^2}} \Biggr |^{\infty }_{t^2} -\int _{t^2}^{\infty } e^{\frac{-y}{2\sigma ^2}} \ d(\sqrt{y}) \Biggr ] = \frac{-2\sigma ^ 2}{\sigma \sqrt{2 \pi }} \Biggr [ -t e^{\frac{-t^2}{2 \sigma ^ 2}} - \sigma \sqrt{2 \pi } F_X(\sqrt{y})\Biggr |^{\infty }_{t^2} \Biggr ] \\ &= \frac{-2\sigma ^ 2}{\sigma \sqrt{2 \pi }} \Biggr [ -t e^{\frac{-t^2}{2 \sigma ^ 2}} - \sigma \sqrt{2 \pi } + \sigma \sqrt{2 \pi } F_X(t) \Biggr ] \\ \end{aligned}$$

The top-k squares sum can be estimated by:

$$\begin{aligned} &k E[Y \ | \ Y\ge t^2] = k \frac{1}{\frac{k}{n}} \frac{-2\sigma ^ 2}{\sigma \sqrt{2 \pi }} \Biggr [ -t e^{\frac{-t^2}{2 \sigma ^ 2}} - \sigma \sqrt{2 \pi } + \sigma \sqrt{2 \pi } F_X(t) \Biggr ] \\ &= n \frac{-2\sigma ^ 2}{\sigma \sqrt{2 \pi }} \Biggr [ -t e^{\frac{-t^2}{2 \sigma ^ 2}} - \sigma \sqrt{2 \pi } + \sigma \sqrt{2 \pi } F_X(t) \Biggr ] \end{aligned}$$

Corollary 1

If values of a \(a\times b\) matrix W are in normal distribution with mean 0, we can estimate the top-k squares sum by Theorem 1.

$$ n \frac{-2\sigma ^ 2}{\sigma \sqrt{2 \pi }} \Biggr [ -t e^{\frac{-t^2}{2 \sigma ^ 2}} - \sigma \sqrt{2 \pi } + \sigma \sqrt{2 \pi } F_X(t) \Biggr ] $$

\(\sigma \) can be estimated by the Frobenius Norm divided by matrix size:

$$\begin{aligned} \sigma = E[X^2] = \frac{\Vert W\Vert _F^2}{n} \end{aligned}$$

and \(n=ab\)

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huang, KH., Sie, CY., Lin, JE., Lee, CR. (2024). LPSD: Low-Rank Plus Sparse Decomposition for Highly Compressed CNN Models. In: Yang, DN., Xie, X., Tseng, V.S., Pei, J., Huang, JW., Lin, J.CW. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2024. Lecture Notes in Computer Science(), vol 14645. Springer, Singapore. https://doi.org/10.1007/978-981-97-2242-6_28

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-2242-6_28

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-2241-9

  • Online ISBN: 978-981-97-2242-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics