Skip to main content

How Much Training Data Is Memorized in Overparameterized Autoencoders? An Inverse Problem Perspective on Memorization Evaluation

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases. Research Track (ECML PKDD 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14942))

  • 760 Accesses

Abstract

Overparameterized autoencoder models often memorize their training data. For image data, memorization is often examined by using the trained autoencoder to recover missing regions in its training images (that were used only in their complete forms in the training). In this paper, we propose an inverse problem perspective for the study of memorization. Given a degraded training image, we define the recovery of the original training image as an inverse problem and formulate it as an optimization task. In our inverse problem, we use the trained autoencoder to implicitly define a regularizer for the particular training dataset that we aim to retrieve from. We develop the intricate optimization task into a practical method that iteratively applies the trained autoencoder and relatively simple computations that estimate and address the unknown degradation operator. We evaluate our method for blind inpainting where the goal is to recover training images from degradation of many missing pixels in an unknown pattern. We examine various deep autoencoder architectures, such as fully connected and U-Net (with various nonlinearities and at diverse train loss values), and show that our method significantly outperforms previous memorization-evaluation methods that recover training data from autoencoders. Importantly, our method greatly improves the recovery performance also in settings that were previously considered highly challenging, and even impractical, for such recovery and memorization evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    Appendices E-H are also available at https://arxiv.org/pdf/2310.02897.

  2. 2.

    In our case where the original image pixel values are in [0, 1]:   \(PSNR=10\log _{10}\left( {\frac{1}{MSE}}\right) \).

References

  1. Afonso, M.V., Bioucas-Dias, J.M., Figueiredo, M.A.T.: Fast image recovery using variable splitting and constrained optimization. IEEE Trans. Image Process. 19(9), 2345–2356 (2010)

    Article  MathSciNet  Google Scholar 

  2. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)

    Article  Google Scholar 

  3. Brifman, A., Romano, Y., Elad, M.: Turning a denoiser into a super-resolver using plug and play priors. In: 2016 IEEE International Conference on Image Processing (ICIP) (2016)

    Google Scholar 

  4. Carlini, N., Chien, S., Nasr, M., Song, S., Terzis, A., Tramèr, F.: Membership inference attacks from first principles. In: 2022 IEEE Symposium on Security and Privacy (SP), pp. 1897–1914 (2022)

    Google Scholar 

  5. Chan, S.H., Wang, X., Elgendy, O.A.: Plug-and-play ADMM for image restoration: fixed-point convergence and applications. IEEE Trans. Comput. Imag. 3(1), 84–98 (2017)

    Article  MathSciNet  Google Scholar 

  6. Dar, Y., Bruckstein, A.M., Elad, M., Giryes, R.: Postprocessing of compressed images via sequential denoising. IEEE Trans. Image Process. 25(7), 3044–3058 (2016)

    Article  MathSciNet  Google Scholar 

  7. Dar, Y., Mayer, P., Luzi, L., Baraniuk, R.G.: Subspace fitting meets regression: the effects of supervision and orthonormality constraints on double descent of generalization errors. In: International Conference on Machine Learning (ICML), pp. 2366–2375 (2020)

    Google Scholar 

  8. Hertrich, J., Neumayer, S., Steidl, G.: Convolutional proximal neural networks and plug-and-play algorithms. Linear Algebra Appl. 631, 203–234 (2021)

    Article  MathSciNet  Google Scholar 

  9. Hu, H., Salcic, Z., Sun, L., Dobbie, G., Yu, P.S., Zhang, X.: Membership inference attacks on machine learning: a survey. ACM Comput. Surv. 54(11s), 1–37 (2022)

    Article  Google Scholar 

  10. Jiang, Y., Pehlevan, C.: Associative memory in iterated overparameterized sigmoid autoencoders. In: Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 4828–4838. PMLR (13–18 Jul 2020)

    Google Scholar 

  11. Kamilov, U.S., Mansour, H., Wohlberg, B.: A plug-and-play priors approach for solving nonlinear imaging inverse problems. IEEE Signal Process. Lett. 24(12), 1872–1876 (2017)

    Article  Google Scholar 

  12. Moreau, J.J.: Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. France 93, 273–299 (1965)

    Article  MathSciNet  Google Scholar 

  13. Nouri, A., Seyyedsalehi, S.A.: Eigen value based loss function for training attractors in iterated autoencoders. Neural Netw. 161, 575–588 (2023)

    Article  Google Scholar 

  14. Radhakrishnan, A., Belkin, M., Uhler, C.: Overparameterized neural networks implement associative memory. Proc. Natl. Acad. Sci. 117(44), 27162–27170 (2020)

    Article  MathSciNet  Google Scholar 

  15. Radhakrishnan, A., Uhler, C., Belkin, M.: Downsampling leads to image memorization in convolutional autoencoders (2018)

    Google Scholar 

  16. Radhakrishnan, A., Yang, K., Belkin, M., Uhler, C.: Memorization in overparameterized autoencoders. arXiv preprint arXiv:1810.10333 (2018)

  17. Rond, A., Giryes, R., Elad, M.: Poisson inverse problems by the plug-and-play scheme. J. Vis. Commun. Image Represent. 41, 96–108 (2016)

    Article  Google Scholar 

  18. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

    Chapter  Google Scholar 

  19. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60(1–4), 259–268 (1992)

    Article  MathSciNet  Google Scholar 

  20. Shokri, R., Stronati, M., Song, C., Shmatikov, V.: Membership inference attacks against machine learning models. In: 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18. IEEE (2017)

    Google Scholar 

  21. Sreehari, S., et al.: Plug-and-play priors for bright field electron tomography and sparse interpolation. IEEE Trans. Comput. Imaging 2(4), 408–423 (2016)

    Article  MathSciNet  Google Scholar 

  22. Venkatakrishnan, S.V., Bouman, C.A., Wohlberg, B.: Plug-and-play priors for model based reconstruction. In: IEEE GlobalSIP (2013)

    Google Scholar 

  23. Wang, Y., Yu, J., Zhang, J.: Zero-shot image restoration using denoising diffusion null-space model. In: International Conference on Learning Representations (ICLR) (2023)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the Lynn and William Frankel Center for Computer Science at Ben-Gurion University, and by the Israeli Council for Higher Education (CHE) via the Data Science Research Center, Ben-Gurion University of the Negev.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Koren Abitbul .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6795 KB)

Appendices

Appendices

A Proof of Theorem 1

In this Appendix, we prove Theorem 1.

Lemma A.1

Given a 2-layer tied autoencoder, f, which can be formulated as

$$\begin{aligned} f(\textbf{x}) = \textbf{W}^{T} \rho (\textbf{W} \textbf{x}) \end{aligned}$$

for \(\textbf{x}\in \mathbb {R}^d\), where \(\textbf{W} \in \mathbb {R}^{m \times d}\), \(\rho : \mathbb {R}^{m}\rightarrow \mathbb {R}^{m}\). Specifically, the activation function has a (separable) componentwise form

$$\begin{aligned} \rho (\textbf{z})=[\bar{\rho }(z_1),\dots ,\bar{\rho }(z_m)]^T ~~~\text {for}~~ \textbf{z}\in \mathbb {R}^m \end{aligned}$$
(21)

whose \(j^\textrm{th}\) component is denoted as \(z_j\), and for a scalar activation function \(\bar{\rho }:\mathbb {R}\rightarrow \mathbb {R}\). We denote \(\textbf{z}\triangleq \textbf{W} \textbf{x}\).

Then, the Jacobian matrix of f is

$$\begin{aligned} \frac{df(\textbf{x})}{d\textbf{x}} = \textbf{W}^{T} \mathrm{{diag}}\left( \frac{d\bar{\rho }(z_1)}{d z_1}, \frac{d\bar{\rho }(z_2)}{d z_2}, \ldots , \frac{d\bar{\rho }(z_m)}{d z_m }\right) \textbf{W}. \end{aligned}$$
(22)

where \(\text {diag}(\cdot )\) represents a diagonal matrix with the given components along the main diagonal.

Proof

Let us define auxiliary variables. In addition to \(\textbf{z}\triangleq \textbf{W} \textbf{x}\), we also define \(\textbf{a}\triangleq \rho (\textbf{z})\) and \(\boldsymbol{\xi }\triangleq f(\textbf{x}) = \textbf{W}^{T} \textbf{a}\).

Then, by the chain rule, we get

$$\begin{aligned} \frac{df(\textbf{x})}{d\textbf{x}} = \frac{d\boldsymbol{\xi }}{d\textbf{a}} \cdot \frac{d\textbf{a}}{d\textbf{z}} \cdot \frac{d\textbf{z}}{d\textbf{x}} = \textbf{W}^{T} \cdot \frac{d\textbf{a}}{d\textbf{z}} \cdot \textbf{W} \end{aligned}$$
(23)

Next, by definition, \(\frac{d\textbf{a}}{d\textbf{z}} = \frac{d\rho (\textbf{z})}{d\textbf{z}}\), and since \(\rho (\textbf{z})\) is a vector of componentwise activation functions (see (21)), this Jacobian is a \(m\times m\) diagonal matrix in the form of

$$\begin{aligned} \frac{d\rho (\textbf{z})}{d\textbf{z}} = \mathrm{{diag}}\left( \frac{d\bar{\rho }(z_1)}{d z_1}, \frac{d\bar{\rho }(z_2)}{d z_2}, \ldots , \frac{d\bar{\rho }(z_m)}{d z_m }\right) \end{aligned}$$
(24)

Substituting (24) back into (23) gives the Jacobian formula of (22).

Corollary A.1

Let \(\bar{\rho }:\mathbb {R}\rightarrow \mathbb {R}\) be a scalar activation function that is differentiable and has derivatives in [0, 1], namely, \(\frac{d \bar{\rho }(z)}{dz} \in [0,1]\) for any \(z\in \mathbb {R}\). Then, for such activation function, a 2-layer tied autoencoder has a Jacobian in the form of \(\textbf{W}^{T} \textbf{D} \textbf{W}\), where \(\textbf{D}\) is a diagonal matrix whose values are in [0, 1].

Lemma A.2

Let \(\textbf{W}\in \mathbb {R}^{c_2\times c_1}\) and \(\textbf{D}\) is a \(c_2\times c_2\) diagonal matrix with values in [0, 1]. Then, \(\textbf{W}^T \textbf{D} \textbf{W}\) is a symmetric positive semi-definite matrix.

Proof

First, we show that the matrix is symmetric:

$$\begin{aligned} (\textbf{W}^T \textbf{D} \textbf{W})^T = \textbf{W}^T \textbf{D}^T \textbf{W} = \textbf{W}^T \textbf{D} \textbf{W}, \end{aligned}$$

where \(\textbf{D}^T=\textbf{D}\) due to the symmetry of a diagonal matrix.

Now, we prove that the matrix \(\textbf{W}^T \textbf{D} \textbf{W}\) is positive semi-definite. Namely, we need to show that for any \(\textbf{r} \in \mathbb {R}^{c_1}\), \(\textbf{r}^T\textbf{W}^T\textbf{D}\textbf{W}\textbf{r} \ge 0\). Define \(\widetilde{\textbf{r}} \triangleq \textbf{W} \textbf{r}\), then we need to show that \(\widetilde{\textbf{r}}^T\textbf{D}\widetilde{\textbf{r}} \ge 0\). This holds because, by denoting \(\widetilde{r}_i\) as the \(i^\textrm{th}\) component of \(\widetilde{\textbf{r}}\) and \(\textbf{D}_{i,i}\) as the \(i^\textrm{th}\) main diagonal component of \(\textbf{D}\), we get \(\widetilde{\textbf{r}}^T\textbf{D}\widetilde{\textbf{r}}=\sum _{i=1}^{c_1} \textbf{D}_{i,i}\widetilde{r}_i^2 \ge 0\) because \(\textbf{D}_{i,i}\in [0,1]\) for any i.

Now we proceed to prove Theorem 1, i.e., that a tied autoencoder f from the class described in the theorem is a Moreau proximity operator.

Proof

We prove that \(f(\textbf{x})\) is a Moreau proximity operator by showing that the Jacobian matrix of \(f(\textbf{x})\) w.r.t. any \(\textbf{x} \in \mathbb {R}^d\) satisfies two properties: (i) the Jacobian is a symmetric matrix, and (ii) all the Jacobian matrix eigenvalues are real and in the range of \([0,1]\). Note that previous works on plug and play prior used conditions (i)-(ii) to prove that special types of denoisers are Moreau proximity operators, for example, see [21].

Consider a 2-layer tied autoencoder, f, which can be formulated as

$$\begin{aligned} f(\textbf{x}) = \textbf{W}^{T} \rho (\textbf{W} \textbf{x}) \end{aligned}$$

where \(\textbf{W} \in \mathbb {R}^{m \times d}\) has all its singular values in [0, 1], and \(\rho : \mathbb {R}^{m}\rightarrow \mathbb {R}^{m}\) is a componentwise activation function as in (21) that is based on a differentiable scalar activation function \(\bar{\rho }:\mathbb {R}\rightarrow \mathbb {R}\) whose derivative is in [0, 1].

We will now prove that f is a Moreau proximity operator.

From Corollary A.1 and Lemma A.2, we get that f has a Jacobian matrix \(\textbf{W}^{T} \textbf{D} \textbf{W}\), which is symmetric and positive semi-definite.

We will now prove that the eigenvalues of \(\textbf{W}^{T} \textbf{D} \textbf{W}\) are all in \([0,1]\). First, notice that the singular values of \(\textbf{D}\) are the same as the eigenvalues, which are the diagonal elements that are in \([0,1]\). In addition, the singular values of \(\textbf{W}^{T}\) are the same as the singular values of \(\textbf{W}\), which are in [0, 1] by the assumption of Theorem 1. Hence, the singular values of each of the matrices in the product \(\textbf{W}^{T} \textbf{D} \textbf{W}\) are in [0, 1]. It is also well known that for every two matrices, \(\textbf{A}\in \mathbb {R}^{q_1\times q_2}\), \(\textbf{B}\in \mathbb {R}^{q_2\times q_3}\),

$$\begin{aligned} \sigma _i(\textbf{AB}) \le \sigma _1(\textbf{A}) \sigma _i(\textbf{B}) \end{aligned}$$
(25)

where \(\sigma _i\) denotes the \(i^\textrm{th}\) largest singular value of a corresponding matrix. Hence, for \(\textbf{C}\in \mathbb {R}^{q_3\times q_4}\),

$$\begin{aligned} \sigma _i(\textbf{ABC}) \le \sigma _1(\textbf{A}) \sigma _1(\textbf{B}) \sigma _i(\textbf{C}). \end{aligned}$$

In our case, \(\sigma _1(\textbf{W}^{T})\le 1\), \(\sigma _1(\textbf{D})\le 1\), \(\sigma _i(\textbf{W})\le 1\), and therefore

$$\begin{aligned} \sigma _i(\textbf{W}^{T} \textbf{D} \textbf{W}) \le \sigma _1(\textbf{W}^{T}) \sigma _1(\textbf{D}) \sigma _i(\textbf{W}) \le 1. \end{aligned}$$
(26)

Consequently, all the singular values of the Jacobian matrix are in [0, 1]. Moreover, for real symmetric matrices, the absolute values of the eigenvalues are equal to the singular values. Since the Jacobian of our 2-layer tied autoencoder is real and symmetric, by (26) we get that the eigenvalues of this Jacobian are in \([-1,1]\). Moreover, by Lemma A.2, the Jacobian is symmetric positive semi-definite and therefore its eigenvalues are non-negative; accordingly, all the eigenvalues of the Jacobian are in \([0,1]\).

To sum up, we showed that a 2-layer tied autoencoder that satisfies the conditions in Theorem 1 has a symmetric semi-positive definite Jacobian with eigenvalues in [0, 1]; therefore, such a 2-layer autoencoder is a Moreau proximity operator.

B Proof of Equation (17)

Recall the notations in (8). The optimization problem (17), i.e.,

$$\begin{aligned} \widehat{\boldsymbol{\xi }}^{(k)} = \mathop {\mathrm {arg\,min}}\limits _{\textbf{x}\in \mathbb {R}^d} \left\| {{ \mathbf {\Theta } \textbf{x} - \textbf{y}}}\right\| _2^2 + \frac{\gamma }{2} \left\| {{ \textbf{x} - \widetilde{\textbf{v}}^{(k)}}}\right\| _2^2 \end{aligned}$$
(27)

has a closed form solution

$$\begin{aligned} \widehat{\boldsymbol{\xi }}^{(k)} = \left( {\mathbf {\Theta }^T \mathbf {\Theta } + \frac{\gamma }{2}\textbf{I}}\right) ^{-1}\left( {\mathbf {\Theta }\textbf{y} + \frac{\gamma }{2}\widetilde{\textbf{v}}^{(k)}}\right) \end{aligned}$$
(28)

Then, the diagonal structure of \(\mathbf {\Theta }\) with zeros and ones on its main diagonal implies that \(\mathbf {\Theta }^T \mathbf {\Theta } = \mathbf {\Theta }\) and, therefore,

$$\begin{aligned} \widehat{\boldsymbol{\xi }}^{(k)} = \left( {\mathbf {\Theta } + \frac{\gamma }{2}\textbf{I}}\right) ^{-1}\left( {\mathbf {\Theta }\textbf{y} + \frac{\gamma }{2}\widetilde{\textbf{v}}^{(k)}}\right) \end{aligned}$$
(29)

that can be further simplified to the componentwise form of

$$ \widehat{\xi _i}^{(k)} = {\left\{ \begin{array}{ll} \widetilde{v}^{(k)}_i, & \text {if } \mathbf {\Theta }_{i,i} = 0 \\ \frac{y_i + \frac{\gamma }{2}\widetilde{v}^{(k)}_i}{1 + \frac{\gamma }{2}}, & \text {if } \mathbf {\Theta }_{i,i} = 1 \end{array}\right. } $$

where \(\widehat{\xi }_i^{(k)}\), \(y_i\), \(\widetilde{v}^{(k)}_i\) are the \(i^\textrm{th}\) components of the vectors \(\widehat{\boldsymbol{\xi }}^{(k)}\), \(\textbf{y}\), \(\widetilde{\textbf{v}}^{(k)}\), respectively.

C The Examined Autoencoder Architectures

We trained two fully connected (FC) autoencoder architectures, one with 10 layers and one with 20 layers (see Figure C.1 in the Supplementary Material). We also trained a U-Net autoencoder model (see Figure C.2 in the Supplementary Material). The activation functions used for training the models were Leaky ReLU, PReLU, and Softplus. To ensure reproducibility, all experiments were with seed 42.

1.1 C.1 Perfect Fitting Regime

In the experiments that arrive to the perfect fitting regime, the FC models were trained on 600 images from Tiny ImageNet (at \(64 \times 64 \times 3\) pixel size) up to a minimum MSE loss of \(10^{{-8}}\), which can be considered as numerical perfect fitting. During training, intermediate models at higher train loss values were saved and used later for evaluation of the recovery at lower overfitting levels (see, e.g., Figure F.2). We trained two versions of the 10-layer and 20-layer fully connected models using Leaky ReLU and PReLU activations for each architecture.

The U-Net model was trained on 50 images from the SVHN dataset (at \(32 \times 32 \times 3\) pixel size) also to an MSE train loss of \(10^{{-8}}\), while saving intermediate models at higher train loss values. We examined U-Net architectures for three different activation functions: Leaky ReLU, PReLU, and Softplus.

1.2 C.2 Moderate Overfitting Regime

In the moderate overfitting regime, we trained a 20-layer FC model on a larger subset of 25,000 images from Tiny ImageNet (at \(64 \times 64 \times 3\) pixel size) with Leaky ReLU activations to a loss of \(10^{{-4}}\). This achieves moderate overfitting, yet not perfect fitting of the training data.

The U-Net model was trained on 1000 images from the CIFAR-10 dataset (at \(32 \times 32 \times 3\) pixel size) to an MSE train loss of \(10^{{-4}}\). We examined U-Net architectures for three different activation functions: Leaky ReLU, PReLU, and Softplus.

D The Proposed Method: Additional Implementation Details

1.1 D.1 Stopping Criterion of the Proposed Method

The stopping criterion for the ADMM via Algorithm 1 (which solves equation (6)) is a predefined number of iterations. We set this to 40 iterations. Each alternating minimization iteration in our overall algorithm includes one ADMM procedure.

The stopping criterion for the entire recovery algorithm via alternating minimization, (6)-(7), is that the MSE between successive estimates \(\widehat{\textbf{x}}^{(t)}\) must be below a threshold for 3 consecutive iterations; we set this MSE threshold to \(10^{-9}\).

1.2 D.2 \(\gamma \) Values for the Proposed Method

The \(\gamma \) value of the ADMM (Algorithm 1) was set as follows. \(\gamma =0.5\) for the 10-layer FC autoencoder with LReLU activations. \(\gamma =0.1\) for the 10-layer FC autoencoder with PReLU activations, and for the 20-layer FC autoencoder. \(\gamma =1\) for the U-Net architecture.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Abitbul, K., Dar, Y. (2024). How Much Training Data Is Memorized in Overparameterized Autoencoders? An Inverse Problem Perspective on Memorization Evaluation. In: Bifet, A., Davis, J., Krilavičius, T., Kull, M., Ntoutsi, E., Žliobaitė, I. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14942. Springer, Cham. https://doi.org/10.1007/978-3-031-70344-7_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70344-7_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70343-0

  • Online ISBN: 978-3-031-70344-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics