Skip to main content

Effect of Geometric Complexity on Intuitive Model Selection

  • Conference paper
  • First Online:
Machine Learning, Optimization, and Data Science (LOD 2021)

Abstract

Occam’s razor is the principle stating that, all else being equal, simpler explanations for a set of observations are to be preferred to more complex ones. This idea can be made precise in the context of statistical inference, where the same quantitative notion of complexity of a statistical model emerges naturally from different approaches based on Bayesian model selection and information theory. The broad applicability of this mathematical formulation suggests a normative model of decision-making under uncertainty: complex explanations should be penalized according to this common measure of complexity. However, little is known about if and how humans intuitively quantify the relative complexity of competing interpretations of noisy data. Here we measure the sensitivity of naive human subjects to statistical model complexity. Our data show that human subjects bias their decisions in favor of simple explanations based not only on the dimensionality of the alternatives (number of model parameters), but also on finer-grained aspects of their geometry. In particular, as predicted by the theory, models intuitively judged as more complex are not only those with more parameters, but also those with larger volume and prominent curvature or boundaries. Our results imply that principled notions of statistical model complexity have direct quantitative relevance to human decision-making.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    In a similar way, one could define a two-dimensional model represented by a 2D area on the screen. This approach would be useful to provide an additional evaluation point for the dependence of the simplicity bias on model dimensionality. However, unlike a 0D or 1D model, a 2D model in a 2D data space will always suffer from boundary effects for data falling anywhere outside the model manifold. Therefore, because one primary goal of this study was to disentangle the distinct contributions of the models’ different geometrical features to the simplicity bias, we only use 1D models.

References

  1. Abramowitz, M., Stegun, I.A.: Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables. Dover, New York (1972)

    MATH  Google Scholar 

  2. Amari, S.I., Nagaoka, H.: Methods of information geometry. Translations of Mathematical Monographs. American Mathematical Society (2000)

    Google Scholar 

  3. Balasubramanian, V.: Statistical inference, occam’s razor, and statistical mechanics on the space of probability distributions. Neural Comput. 9(2), 349–368 (1997). https://doi.org/10.1162/neco.1997.9.2.349

    Article  MATH  Google Scholar 

  4. Betancourt, M.: A conceptual introduction to hamiltonian monte carlo (2018). https://arxiv.org/abs/1701.02434

  5. Bialek, W., Nemenman, I., Tishby, N.: Predictability, complexity and learning. Neural Comput. 13, 2409–2463 (2001). https://doi.org/10.1162/089976601753195969

    Article  MATH  Google Scholar 

  6. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006). https://doi.org/10.1007/978-1-4615-7566-5

    Book  MATH  Google Scholar 

  7. Efron, B., Hinkley, D.L.: Assessing the accuracy of the maximum likelihood estimator: observed versus expected fisher information. Biometrika 65(3), 457–483 (1978). https://doi.org/10.1093/biomet/65.3.457

    Article  MathSciNet  MATH  Google Scholar 

  8. Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B.: Bayesian Data Analysis, 3rd edn. CRC Press, Boca Raton (2014)

    MATH  Google Scholar 

  9. Gelman, A., Hwang, J., Vehtari, A.: Understanding predictive information criteria for Bayesian models. Stat. Comput. 24(6), 997–1016 (2013). https://doi.org/10.1007/s11222-013-9416-2

    Article  MathSciNet  MATH  Google Scholar 

  10. Genewein, T., Braun, D.A.: Occam’s razor in sensorimotor learning. Proc. Roy. Soc. B Biol. Sci. 281(1783), 20132952 (2014). https://doi.org/10.1098/rspb.2013.2952

    Article  Google Scholar 

  11. Grünwald, P.D.: The Minimum Description Length Principle. MIT press, Cambridge (2007)

    Book  Google Scholar 

  12. Gull, S.F.: Bayesian inductive inference and maximum entropy. In: Erickson, G.J., Smith, C.R. (eds.) Maximum-Entropy and Bayesian Methods in Science and Engineering, pp. 53–74. Springer, Netherlands (1988). https://doi.org/10.1007/978-94-009-3049-0_4

  13. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning, 2nd edn. Springer, Heidelberg (2009)

    Book  MATH  Google Scholar 

  14. Hoffman, M.D., Gelman, A.: The No-U-Turn sampler: adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res. 15(47), 1593–1623 (2014). http://jmlr.org/papers/v15/hoffman14a.html

  15. Jaynes, E.T.: Probability Theory: The Logic of Science. Cambridge University Press, Cambridge (2003)

    Book  MATH  Google Scholar 

  16. Jeffreys, H.: Theory of Probability. Clarendon Press, Oxford (1939)

    MATH  Google Scholar 

  17. Johnson, S., Jin, A., Keil, F.: Simplicity and goodness-of-fit in explanation: the case of intuitive curve-fitting. In: Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 36, no. 36 (2014)

    Google Scholar 

  18. Kruschke, J.K.: Doing Bayesian Data Analysis, 2nd edn. Academic Press, Cambridge (2015)

    MATH  Google Scholar 

  19. Kumar, R., Carroll, C., Hartikainen, A., Martin, O.: Arviz a unified library for exploratory analysis of bayesian models in python. J. Open Source Softw. 4(33), 1143 (2019). https://doi.org/10.21105/joss.01143

  20. MacKay, D.J.C.: Bayesian interpolation. Neural Comput. 4(3), 415–447 (1992). https://doi.org/10.1162/neco.1992.4.3.415

    Article  MATH  Google Scholar 

  21. McElreath, R.: Statistical Rethinking. CRC Press, Boca Raton (2016)

    Google Scholar 

  22. Piasini, E., Balasubramanian, V., Gold, J.I.: Preregistration document (2016). https://doi.org/10.17605/OSF.IO/2X9H6

  23. Piasini, E., Balasubramanian, V., Gold, J.I.: Preregistration document addendum. https://doi.org/10.17605/OSF.IO/5HDQZ

  24. Piasini, E., Gold, J.I., Balasubramanian, V.: Information geometry of bayesian model selection (2021, unpublished)

    Google Scholar 

  25. Rissanen, J.: Stochastic complexity and modeling. Ann. Stat. 14(3), 1080–1100 (1986). https://www.jstor.org/stable/3035559

  26. Salvatier, J., Wiecki, T.V., Fonnesbeck, C.: Probabilistic programming in python using PyMC3. PeerJ Comput. Sci. 2, e55 (2016). https://doi.org/10.7717/peerj-cs.55

  27. Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., Bürkner, P.C.: Rank-normalization, folding, and localization: an improved \(\hat{R}\) for assessing convergence of MCMC. Bayesian Analysis (2020). https://doi.org/10.1214/20-ba1221

Download references

Acknowledgements

We thank Chris Pizzica for help with setting up the web-based version of the experiments, and for managing subject recruitment. We acknowledge support or partial support from R01 NS113241 (EP) and R01 EB026945 (VB and JG).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eugenio Piasini .

Editor information

Editors and Affiliations

Appendices

A Derivation of the Boundary Term in the Fisher Information Approximation

Here we generalize the derivation of the Fisher Information Approximation given by [3] to the case where the maximum likelihood solution for a model lies on the boundary of the parameter space. Apart from the more general assumptions, the following derivation follows closely the original one, with some minor notational changes.

1.1 A.1 Set-up and Hypotheses

The problem we consider here is that of selecting between two models (say \(\mathcal {M}_1\) and \(\mathcal {M}_2\)), after observing empirical data \(X=\left\{ x_i\right\} _{i=1}^N\). N is the sample size and \(\mathcal {M}_1\) is assumed to have d parameters, collectively indexed as \(\vartheta \) taking values in a compact domain \(\varTheta \). As a prior over \(\vartheta \) we take Jeffrey’s prior:

$$\begin{aligned} w(\vartheta ) = \frac{\sqrt{\det g(\vartheta )}}{\int \mathrm {d}\!\,^d\vartheta \sqrt{\det g(\vartheta )}} \end{aligned}$$
(13)

where g is the (expected) Fisher Information of the model \(\mathcal {M}_1\):

$$\begin{aligned} g_{\mu \nu }(\vartheta ) = \mathbb {E}\left[ -\frac{\partial ^2\ln p(x|\vartheta )}{\partial \vartheta ^\mu \partial \vartheta ^\nu }\right] _{\vartheta } \end{aligned}$$
(14)

The Bayesian posterior

$$\begin{aligned} \mathbb {P}(\mathcal {M}_1|X) = \frac{\mathbb {P}(\mathcal {M}_1)}{\mathbb {P}(X)}\int \mathrm {d}\!\,^dw(\vartheta )\mathbb {P}(X|\vartheta ) \end{aligned}$$
(15)

then becomes, after assuming a flat prior over models and dropping irrelevant terms,

$$\begin{aligned} \mathbb {P}(\mathcal {M}_1|X) = \frac{\int _\varTheta \mathrm {d}\!\,^d\vartheta \sqrt{\det g}\exp [-N(-\frac{1}{N}\ln \mathbb {P}(X|\vartheta ))]}{\int \mathrm {d}\!\,^d\vartheta \sqrt{\det g}} \end{aligned}$$
(16)

Just as in [3], we now make a number of regularity assumptions: 1. \(\ln \mathbb {P}(X|\vartheta )\) is smooth; 2. there is a unique global minimum \(\hat{\vartheta }\) for \(\ln \mathbb {P}(X|\vartheta )\); 3. \(g_{\mu \nu }(\vartheta )\) is smooth; 4. \(g_{\mu \nu }(\hat{\vartheta })\) is positive definite; 5. \(\varTheta \subset \mathbb {R}^d\) is compact; and 6. the values of the local minima of \(\ln \mathbb {P}(X|\vartheta )\) are bounded away from the global minimum by some \(\epsilon >0\). Importantly, unlike in [3], we don’t assume that \(\hat{\vartheta }\) is in the interior of \(\varTheta \).

The Shape of \(\varTheta \). Because we are specifically interested in understanding what happens at a boundary of the parameter space, we will add a further assumption that, while being not very restrictive in spirit, will allow us to derive a particularly interpretable result. In particular, we will assume that \(\varTheta \) is specified by a single linear constraint of the form

$$\begin{aligned} D_\mu \vartheta ^\mu + d \ge 0 \end{aligned}$$
(17)

Without loss of generality, we’ll also take the constraint to be expressed in Hessian normal form—namely, \(\Vert D_\mu \Vert =1\).

For clarity, note this assumption on the shape of \(\varTheta \) is only used from Subsect. A.3 onward.

1.2 A.2 Preliminaries

We will now proceed to set up a low-temperature expansion of Eq. 16 around the saddle point \(\hat{\vartheta }\). We start by rewriting the numerator in Eq. 16 as

$$\begin{aligned} \int _\varTheta \mathrm {d}\!\,^d\vartheta \exp \left[ -N\left( -\frac{1}{2N}\ln \det g - \frac{1}{N}\ln \mathbb {P}(X|\vartheta )\right) \right] \end{aligned}$$
(18)

The idea of the Fisher Information Approximation is to expand the integrand in Eq. 18 in powers of N around the maximum likelihood point \(\hat{\vartheta }\). To this end, let’s define three useful objects:

figure a

We immediately note that

which is useful in order to compute

It is also useful to center the integration variables by introducing

$$\begin{aligned} \phi := \sqrt{N}(\vartheta -\hat{\vartheta })\end{aligned}$$
(19)
$$\begin{aligned} \mathrm {d}\!\,^d\phi = N^{d/2}\mathrm {d}\!\,^d\vartheta \end{aligned}$$
(20)

so that

(21)

and Eq. 18 becomes

$$\begin{aligned} \begin{aligned} \int \mathrm {d}\!\,^d\vartheta&\exp [-N\psi ] = N^{-d/2}\int \mathrm {d}\!\,^d\phi \exp \left[ -N\sum _{i=0}^\infty \frac{1}{i!}N^{-i/2}\left( \tilde{I}_{\mu _1\cdots \mu _i}-\frac{1}{2N}F_{\mu _1\cdots \mu _i}\right) \phi ^{\mu _1}\cdots \phi ^{\mu _i}\right] \\&= N^{-d/2}\int \mathrm {d}\!\,^d\phi {\text {exp}}\Bigg \{-N\left( -\frac{1}{N}\ln \mathbb {P}(X|\hat{\vartheta })-\frac{1}{2N}\ln \det g(\hat{\vartheta })\right) + \\&\quad -N\left[ \sum _{i=1}^\infty \frac{1}{i!}N^{-i/2}\left( \tilde{I}_{\mu _1\cdots \mu _i}-\frac{1}{2N}F_{\mu _1\cdots \mu _i}\right) \phi ^{\mu _1}\cdots \phi ^{\mu _i}\right] \Bigg \}\\&= N^{-\frac{d}{2}}\exp [-\left( -\ln \mathbb {P}(X|\hat{\vartheta })-\frac{1}{2}\ln \det g(\hat{\vartheta })\right) ]\times \\&\quad \times \int \mathrm {d}\!\,^d\phi \exp \Bigg \{-N\Bigg [\frac{1}{\sqrt{N}}\tilde{I}_\mu \phi ^\mu + \frac{1}{2N}\tilde{I}_{\mu \nu }\phi ^\mu \phi ^\nu + \\&\qquad + \frac{1}{N}\sum _{i=1}^\infty N^{-\frac{i}{2}}\left( \frac{1}{(i+2)!}\tilde{I}_{\mu _1\cdots \mu _{i+2}}\phi ^{\mu _1}\cdots \phi ^{\mu _{i+2}}-\frac{1}{2i!}F_{\mu _1\cdots \mu _i}\phi ^{\mu _1}\cdots \phi ^{\mu _i}\right) \Bigg ]\Bigg \} \end{aligned} \end{aligned}$$

Therefore,

$$\begin{aligned} \begin{aligned} \mathbb {P}(\mathcal {M}_1|X)&= N^{-\frac{d}{2}}\exp [-\left( -\ln \mathbb {P}(X|\hat{\vartheta })-\frac{1}{2}\ln \det g(\hat{\vartheta })+\ln \int \mathrm {d}\!\,^d\vartheta \sqrt{\det g}\right) ]\times \\&\quad \times \int \mathrm {d}\!\,^d\phi \exp \Bigg [-\sqrt{N}\tilde{I}_\mu \phi ^\mu - \frac{1}{2}\tilde{I}_{\mu \nu }\phi ^\mu \phi ^\nu + \\&\qquad - \sum _{i=1}^\infty N^{-\frac{i}{2}}\left( \frac{1}{(i+2)!}\tilde{I}_{\mu _1\cdots \mu _{i+2}}\phi ^{\mu _1}\cdots \phi ^{\mu _{i+2}}-\frac{1}{2i!}F_{\mu _1\cdots \mu _i}\phi ^{\mu _1}\cdots \phi ^{\mu _i}\right) \Bigg ]\Bigg \}\\&= N^{-\frac{d}{2}}\exp [-\left( -\ln \mathbb {P}(X|\hat{\vartheta })-\frac{1}{2}\ln \det g(\hat{\vartheta })+\ln \int _\varTheta \mathrm {d}\!\,^d\vartheta \sqrt{\det g}\right) ]\cdot Q \end{aligned} \end{aligned}$$
(22)

where

$$\begin{aligned} Q = \int _\varPhi \mathrm {d}\!\,^d\phi \exp \left[ -\sqrt{N}\tilde{I}_\mu \phi ^\mu - \frac{1}{2}\tilde{I}_{\mu \nu }\phi ^\mu \phi ^\nu - G(\phi )\right] \end{aligned}$$
(23)

and

$$\begin{aligned} G(\phi ) = \sum _{i=1}^\infty N^{-\frac{i}{2}}\left( \frac{1}{(i+2)!}\tilde{I}_{\mu _1\cdots \mu _{i+2}}\phi ^{\mu _1}\cdots \phi ^{\mu _{i+2}}-\frac{1}{2i!}F_{\mu _1\cdots \mu _i}\phi ^{\mu _1}\cdots \phi ^{\mu _i}\right) \end{aligned}$$
(24)

where \(G(\phi )\) collects the terms that are suppressed by powers of N.

Our problem has been now reduced to computing Q by performing the integral in Eq. 23. Now our assumptions come into play for the key approximation step. For the sake of simplicity, assuming that N is large we drop \(G(\phi )\) from the expression above, so that Q becomes a simple Gaussian integral with a linear term:

$$\begin{aligned} Q = \int _\varPhi \mathrm {d}\!\,^d\phi \exp \left[ -\sqrt{N}\tilde{I}_\mu \phi ^\mu - \frac{1}{2}\phi ^\mu \tilde{I}_{\mu \nu }\phi ^\nu \right] \end{aligned}$$
(25)

1.3 A.3 Choosing a Good System of Coordinates

Consider now the Observed Fisher Information at maximum likelihood, \(\tilde{I}_{\mu \nu }\). As long as it is not singular, we can define its inverse \(\varDelta ^{\mu \nu }=(\tilde{I}_{\mu \nu })^{-1}\). If \(\tilde{I}_{\mu \nu }\) is positive definite, then the matrix representation of \(\tilde{I}_{\mu \nu }\) will have a set of d positive eigenvalues which we will denote by \(\{\sigma _{(1)}^{-2}, \sigma _{(2)}^{-2},\ldots ,\sigma _{(d)}^{-2}\}\). The matrix representation of \(\varDelta ^{\mu \nu }\) will have eigenvalues \(\{\sigma _{(1)}^{2}, \sigma _{(2)}^{2},\ldots ,\sigma _{(d)}^{2}\}\), and will be diagonal in the same choice of coordinates as \(\tilde{I}_{\mu \nu }\). Denote by U the (orthogonal) diagonalizing matrix, i.e., U is such that

$$\begin{aligned} U\varDelta U^{\intercal } = \begin{bmatrix} \sigma ^2_{(1)}&{}0 &{}\cdots &{} 0 \\ 0 &{} \sigma ^2_{(2)} &{} &{}\vdots \\ \vdots &{} &{} \ddots &{}0\\ 0 &{} \ldots &{} 0 &{} \sigma ^2_{(d)} \end{bmatrix}\quad ,\quad U^\intercal U = U U^\intercal = \mathbb {I} \end{aligned}$$
(26)

Define also the matrix K as the product of the diagonal matrix with elements \(1/\sigma _{(k)}\) along the diagonal and U:

$$\begin{aligned} K = \begin{bmatrix} 1/\sigma _{(1)}&{}0 &{}\cdots &{} 0 \\ 0 &{} 1/\sigma _{(2)} &{} &{}\vdots \\ \vdots &{} &{} \ddots &{}0\\ 0 &{} \ldots &{} 0 &{} 1/\sigma _{(d)} \end{bmatrix} U \end{aligned}$$
(27)

Note that

$$ \det K = \left( \det \varDelta ^{\mu \nu }\right) ^{-1/2} = \sqrt{\det \tilde{I}_{\mu \nu }} $$

and that K corresponds to a sphering transformation, in the sense that

$$\begin{aligned} K\varDelta K^\intercal =\mathbb {I} \quad \text { or }\quad K^{\mu }_{\;\;\kappa }\varDelta ^{\kappa \lambda }K^\nu _{\;\;\lambda } = \delta ^{{\mu \nu }} \end{aligned}$$
(28)

and therefore, if we define the inverse

$$ P = K^{-1} $$

we have

$$\begin{aligned} P^\intercal (\tilde{I}_{\mu \nu }) P=\mathbb {I} \quad \text { or }\quad P^{\kappa }_{\;\;\mu }\tilde{I}_{\kappa \lambda }P^\lambda _{\;\;\nu } = \delta _{{\mu \nu }} \end{aligned}$$
(29)

We can now define a new set of coordinates by centering and sphering, as follows:

$$\begin{aligned} \xi ^\mu = K^\mu _{\;\;\nu } \left( \phi ^\nu + \sqrt{N}\varDelta ^{\nu \kappa }\tilde{I}_\kappa \right) \end{aligned}$$
(30)

Then,

$$\begin{aligned} \mathrm {d}\!\,^d\xi = \sqrt{\det \tilde{I}_{\mu \nu }}\mathrm {d}\!\,^d\phi \end{aligned}$$
(31)

and

$$\begin{aligned} \phi ^\mu = P^\mu _{\;\;\nu }\xi ^\nu - \sqrt{N}\varDelta ^{\mu \nu }\tilde{I}_\nu \end{aligned}$$
(32)

In this new set of coordinates,

$$\begin{aligned}&-\sqrt{N}\tilde{I}_\nu \phi ^\nu -\frac{1}{2}\phi ^\mu \tilde{I}_{\mu \nu }\phi ^\nu = \nonumber \\&\qquad \qquad \qquad \qquad \;\,=-\left( \sqrt{N}\tilde{I}_\nu +\frac{1}{2}\phi ^\mu \tilde{I}_{\mu \nu }\right) \phi ^\nu \nonumber \\&\qquad \qquad \;\; =-\left( \sqrt{N}\tilde{I}_\nu + \frac{1}{2}P^\mu _{\;\;\kappa }\xi ^\kappa \tilde{I}_{\mu \nu }\frac{1}{2}\sqrt{N}\varDelta ^{\mu \kappa }\tilde{I}_\kappa \tilde{I}_{\mu \nu }\right) \phi ^\nu \nonumber \\&=-\sqrt{N}\tilde{I}_\nu P^\nu _{\;\;\lambda }\xi ^\lambda + N\varDelta ^{\nu \lambda }\tilde{I}_\lambda \tilde{I}_\nu - \frac{1}{2}P^\mu _\kappa \xi ^\kappa \tilde{I}_{\mu \nu }P^\nu _{\;\;\lambda }\xi ^\lambda + \frac{\sqrt{N}}{2}P^\mu _{\;\;\kappa }\xi ^\kappa \tilde{I}_{{\mu \nu }}\varDelta ^{\nu \lambda }\tilde{I}_\lambda + \nonumber \\&\qquad \qquad \qquad \,\qquad \;\,+ \frac{\sqrt{N}}{2}\varDelta ^{\mu \kappa }\tilde{I}_\kappa \tilde{I}_{\mu \nu }P^\nu _{\;\;\lambda }\xi ^\lambda - \frac{N}{2}\varDelta ^{\mu \kappa }\tilde{I}_\kappa \tilde{I}_{\mu \nu }\varDelta ^{\nu \lambda }\tilde{I}_\lambda \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \;\;\; =\frac{N}{2}\tilde{I}_\nu \varDelta ^{\nu \lambda }\tilde{I}_\lambda - \frac{1}{2}\xi ^\kappa \delta _{\kappa \lambda }\xi ^\lambda \end{aligned}$$
(33)

where we have used Eq. 29 as well as the fact that \(\varDelta ^{\mu \nu }=\varDelta ^{\nu \mu }\) and that \(\varDelta ^{\mu \kappa }\tilde{I}_{\kappa \nu }=\delta ^\mu _{\;\;\nu }\) by definition.

Therefore, putting Eq. 31 and Eq. 33 together, Eq. 25 becomes

$$\begin{aligned} Q = \frac{\exp [\frac{N}{2}\tilde{I}_\mu \varDelta ^{\mu \nu }\tilde{I}_\nu ]}{\sqrt{\det \tilde{I}_{\mu \nu }}}\int _\varXi \mathrm {d}\!\,^d\xi \exp [-\frac{1}{2}\xi _\mu \delta ^{{\mu \nu }}\xi _\nu ] \end{aligned}$$
(34)

The problem is reduced to a (truncated) spherical gaussian integral, where the domain of integration \(\varXi \) will depend on the original domain \(\varTheta \) but also on \(\tilde{I}_\mu \), \(\tilde{I}_{\mu \nu }\) and \(\hat{\vartheta }\). To complete the calculation, we now need to make this dependence explicit.

1.4 A.4 Determining the Domain of Integration

We start by combining Eq. 19 and Eq. 32 to yield

$$\begin{aligned} \vartheta ^\mu =\frac{1}{\sqrt{N}}P^\mu _{\;\;\nu }\xi ^\nu - \varDelta ^{\mu \nu }\tilde{I}_\nu +\hat{\vartheta }^\mu \end{aligned}$$
(35)

By substituting Eq. 35 into Eq. 17 we get

$$\begin{aligned} D_\mu \left( \frac{P^\mu _{\;\;\nu }\xi ^\nu }{\sqrt{N}} - \varDelta ^{\mu \nu }\tilde{I}_\nu +\hat{\vartheta }^\mu \right) + d \ge 0 \end{aligned}$$

which we can rewrite as

$$\begin{aligned} \tilde{D}_\mu \xi ^\mu + \tilde{d}\ge 0 \end{aligned}$$
(36)

with

$$\begin{aligned} \tilde{D}_\mu := \frac{1}{\sqrt{N}} D_\nu P^{\nu }_{\;\;\mu } \end{aligned}$$
(37)

and

$$\begin{aligned} \begin{aligned} \tilde{d}&:= d + D_\mu \hat{\vartheta }^\mu - D_\mu \varDelta ^{\mu \nu }\tilde{I}_\nu \\&= d + D_\mu \hat{\vartheta }^\mu - \langle D_\mu ,\tilde{I}_\mu \rangle _\varDelta \end{aligned} \end{aligned}$$
(38)

where by \(\langle \cdot ,\cdot \rangle _\varDelta \) we mean the inner product in the inverse observed Fisher information metric. Now, note that whenever \(\tilde{I}_\mu \) is not zero it will be parallel to \(D_\mu \). Indeed, by construction of the maximum likelihood point \(\hat{\vartheta }\), the gradient of the log likelihood can only be orthogonal to the boundary at \(\hat{\vartheta }\), and pointing towards the outside of the domain; therefore \(\tilde{I}_\mu \), which is defined as minus the gradient, will point inward. At the same time, \(D_\mu \) will also always point toward the interior of the domain because of the form of the constraint we have chosen in Eq. 17. Because by assumption \(\Vert D_\mu \Vert =1\), we have that

$$ \tilde{I}_\mu = \Vert \tilde{I}_\nu \Vert D_\mu $$

and

$$ \langle D_\mu ,\tilde{I}_\mu \rangle _\varDelta = \Vert D_\nu \Vert _\varDelta \cdot \Vert \tilde{I}_\nu \Vert _\varDelta $$

so that

$$\begin{aligned} \tilde{d}= d + D_\mu \hat{\vartheta }^\mu - \Vert D_\mu \Vert _\varDelta \cdot \Vert \tilde{I}_\mu \Vert _\varDelta \end{aligned}$$
(39)

Now, the signed distance of the boundary to the origin in \(\xi \)-space is

$$ l = -\frac{\tilde{d}}{\Vert \tilde{D}_\mu \Vert } $$

where the sign is taken such that l is negative when the origin is included in the integration domain. But noting that

$$ K^\mu _{\;\;\kappa }\varDelta ^{\kappa \lambda }K^\nu _{\;\;\lambda }=\delta ^{{\mu \nu }} \quad \Rightarrow \quad \varDelta ^{\mu \nu }= P^\mu _{\;\;\kappa }\delta ^{\kappa \lambda }P^\nu _{\;\;\lambda } $$

we have

$$ \begin{aligned} \Vert \tilde{D}_\mu \Vert&= \sqrt{\tilde{D}_\mu \delta ^{{\mu \nu }}\tilde{D}_\nu } = \sqrt{\frac{1}{N}D_\kappa \left( P^\kappa _{\;\;\mu }\delta ^{\mu \nu }P^\lambda _{\;\;\nu }\right) D_\lambda }\\&=\sqrt{\frac{1}{N}D_\kappa \varDelta ^{\kappa \lambda }D_\lambda } =\frac{\Vert D_\mu \Vert _\varDelta }{\sqrt{N}} \end{aligned} $$

and therefore

$$\begin{aligned} l = -\sqrt{N}\frac{\tilde{d}}{\Vert D_\mu \Vert } \end{aligned}$$
(40)

Finally, by plugging Eq. 39 into Eq. 40 we obtain

$$\begin{aligned} \begin{aligned} l&= -\sqrt{N}\left[ \frac{d+D_\mu \hat{\vartheta }^\mu }{\Vert D_\mu \Vert _\varDelta } - \Vert \tilde{I}_\mu \Vert _\varDelta \right] \\&=:\sqrt{2}\left( s-m\right) \end{aligned} \end{aligned}$$
(41)

where m and s are defined for convenience like so:

$$\begin{aligned} m:= \sqrt{\frac{N}{2}}\frac{d+D_\mu \hat{\vartheta }^\mu }{\Vert D_\mu \Vert _\varDelta }\quad (\ge 0)\end{aligned}$$
(42)
$$\begin{aligned} s:= \sqrt{\frac{N}{2}}\Vert \tilde{I}_\mu \Vert _\varDelta \quad (\ge 0) \end{aligned}$$
(43)

We note that m is a rescaled version of the margin defined by the constraint on the parameters (and therefore is never negative by assumption), and s is a rescaled version of the norm of the gradient of the log likelihood in the inverse observed Fisher metric (and therefore is nonnegative by construction).

1.5 A.5 Computing the Penalty

We can now perform a final change of variables in the integral in Eq. 34. We rotate our coordinates to align them to the boundary, so that

$$ \tilde{D}_\mu = (\Vert \tilde{D}_\mu \Vert ,0,0,\ldots ,0) $$

Note that we can always do this as our integrand is invariant under rotation. In this coordinate system, Eq. 34 factorizes:

$$\begin{aligned} \begin{aligned} Q&= \frac{\exp [\frac{N}{2}\tilde{I}_\mu \varDelta ^{\mu \nu }\tilde{I}_\nu ]}{\sqrt{\det \tilde{I}_{\mu \nu }}}\int _{\mathbb {R}^{d-1}}\mathrm {d}\!\,^{d-1}\xi \exp [-\frac{\xi _\mu \delta ^{{\mu \nu }}\xi _\nu }{2}] \int _{l}^\infty \mathrm {d}\!\,\zeta \exp [-\frac{\zeta ^2}{2}]\\&=\sqrt{\frac{(2\pi )^d}{\det \tilde{I}_{\mu \nu }}}\exp [\frac{N}{2}\Vert \tilde{I}\Vert _\varDelta ^2]\frac{1}{\sqrt{\pi }}\int _{l}^\infty \frac{\mathrm {d}\!\,\zeta }{\sqrt{2}} \exp [-\frac{\zeta ^2}{2}]\\&=\sqrt{\frac{(2\pi )^d}{\det \tilde{I}_{\mu \nu }}}\exp (s^2)\frac{1}{\sqrt{\pi }}\int _{l/\sqrt{2}}^\infty \mathrm {d}\!\,\zeta \exp [-\zeta ^2]\\&=\sqrt{\frac{(2\pi )^d}{\det \tilde{I}_{\mu \nu }}}\exp (s^2)\frac{\mathrm {erfc}(s-m)}{2} \end{aligned} \end{aligned}$$
(44)

where \(\mathrm {erfc}(\cdot )\) is the complementary error function ([1], [Section 7.1.2]).

Finally, plugging Eq. 44 into Eq. 22 and taking the log, we obtain the extended FIA:

$$\begin{aligned} -\ln \mathbb {P}(\mathcal {M}_1|E) \simeq \ln \mathbb {P}(E|\hat{\vartheta })+\frac{d}{2}\ln \frac{N}{2\pi } + \ln \int _\varTheta \mathrm {d}\!\,^d\vartheta \sqrt{\det g} + \frac{1}{2}\ln \left[ \frac{\det \tilde{I}_{\mu \nu }}{\det g_{\mu \nu }}\right] + S \end{aligned}$$
(45)

where

$$\begin{aligned} S:= \ln (2) - \ln \left[ \exp (s^2)\mathrm {erfc}(s-m)\right] \end{aligned}$$
(46)

can be interpreted as a penalty arising from the presence of the boundary in parameter space.

1.6 A.6 Interpreting the Penalty

We will now take a closer look at Eq. 46. To do this, one key observation we will use is that, by construction, at most one of m and s is ever nonzero. This is because in the interior of the manifold, \(m>0\) by definition, but \(s=0\) because the gradient of the likelihood is zero at \(\hat{\vartheta }\); and on the boundary, \(m=0\) by definition, and s can be either zero or positive.

Interior of the Manifold. When \(\hat{\vartheta }\) is in the interior of the parameter space \(\varTheta \), then \(\tilde{I}_\mu =0\Rightarrow s=0\) and Eq. 46 simplifies to

$$\begin{aligned} S = \ln (2) -\ln (\mathrm {erfc}(-m)) \end{aligned}$$
(47)

but since N is large we have \(m\gg 0\), \(\mathrm {erfc}(-m)\rightarrow 2\) and \(S\rightarrow 0\), so our result passes the first sanity check: we recover the expression in [3].

Boundary of the Manifold. When \(\hat{\vartheta }\) is on the boundary of \(\varTheta \), \(m=0\) and \(s\ge 0\). Equation 46 becomes

$$\begin{aligned} S=\ln (2) - \ln \left[ \exp (s^2)\mathrm {erfc}(s)\right] = \ln (2) - \ln (w(is)) \end{aligned}$$
(48)

where w is the Feddeeva function ([1], [Section 7.1.3]):

$$\begin{aligned} w(z) = e^{-z^2}\mathrm {erfc}(-iz) \end{aligned}$$

This function is tabulated and can be computed efficiently. However, it is interesting to analyze its limiting behavior.

As a consistency check, when s is small we have at fixed N, to first order:

$$\begin{aligned} \begin{aligned} S&\simeq \ln (2) - \ln (1-\frac{2s}{\sqrt{\pi }})\\&\simeq \ln (2) + \frac{2s}{\sqrt{\pi }}=\ln (2) + \sqrt{\frac{2N}{\pi }}\Vert \tilde{I}_\mu \Vert _\varDelta \end{aligned} \end{aligned}$$
(49)

and \(S=\ln (2)\) when \(\tilde{I}_\mu =0\), as expected.

However, the real case of interest is the behavior of the penalty when N is assumed to be large, as this is consistent with the fact that we derived Eq. 44 as an asymptotic expansion of Eq. 23. In this case, using the asymptotic expansion for the Feddeeva function ([1], [Section 7.1.23]):

$$\begin{aligned} \exp [s^2]\mathrm {erfc}(s)\sim \frac{1}{s\sqrt{\pi }}\left[ 1+\sum _{m=1}^\infty (-1)^m\frac{1\cdot 3\cdots (2m-1)}{(2s^2)^m}\right] \end{aligned}$$

To leading order we obtain

$$\begin{aligned} \begin{aligned} S&\simeq \ln (2)+\ln (s\sqrt{\pi })\\&= \ln (2) + \ln (\sqrt{\frac{N\pi }{2}}\Vert \tilde{I}_\mu \Vert _\varDelta ) \end{aligned} \end{aligned}$$

which we can rewrite as

$$\begin{aligned} \boxed {S\simeq \frac{1}{2}\ln \frac{N}{2\pi } + \ln \left[ 2\pi \Vert \tilde{I}_\mu \Vert _\varDelta \right] } \end{aligned}$$
(50)

We can summarize the above by saying that a new penalty term of order \(\ln N\) arose due to the presence of the boundary. Interestingly, comparing Eq. 50 with Eq. 45 we see that the first term in Eq. 50 is analogous to counting an extra parameter dimension in the original Fisher Information Approximation.

Fig. 5.
figure 5

Comparison of Fisher Information Approximation and full Bayes computation of the log posterior ratio (LPR) for the model pairs used in our psychophysics tasks (\(N=10\)). Each row corresponds to one task type (from top to bottom, “horizontal”, “point”, “rounded”, “vertical”). First column from the left: full Bayesian LPR, computed by numerical integration. Second column: LPR computed with the Fisher Information Approximation. Third column: difference between FIA and exact LPR. Fourth column: relative difference (difference divided by the absolute value of the FIA LPR).

1.7 A.7 Numerical Comparison of the Extended FIA vs Exact Bayes

Figure 5 shows that the FIA computed with the expressions given above provides a very good approximation to the exact Bayesian log posterior ratio (LPR) for the model pairs used in the psychophysics experiments, and for the chosen sample size (\(N=10\)). As highlighted in the panels in the rightmost column, the discrepancies between the exact and the approximated LPR are generally small in relative terms, and therefore are not very important for the purpose of model fitting and interpretation. Note that here, as well as for the results in the main text, the S term in the FIA is computed using Eq. 46 rather than Eq. 50 in order to avoid infinities (that for finite N can arise when the likelihood gradient is very small) and discontinuities (that for finite N can arise on the interior of the manifold, in proximity to the boundary, where the value of S goes from zero when \(\hat{\vartheta }\) is in the interior to \(\log (2)\) when \(\hat{\vartheta }\) is exactly on the boundary).

Even though overall the agreement between the approximation is good, it is interesting to look more closely at where it is the least so. The task type for which the discrepancies are the largest (both in absolute and relative terms) is the “rounded” type (third row in Fig. 5). This is because the FIA hypotheses are not fully satisfied everywhere for one of the models. More specifically, the models in that task variant are a circular arc (the bottom model in Fig. 5, third row) and a smaller circular arc, concentric with the first, with a straight segment attached to either side (the top model). The log-likelihood function for this second model is only smooth to first order, but its second derivative (and therefore its Fisher Information and its observed Fisher Information) are not continuous at the points where the circular arc is joined with the straight segments, locally breaking hypothesis number 3 in Subsect. A.1. Geometrically, this is analogous to saying that the curvature of the manifold changes abruptly at the joints. It is likely that the FIA for a model with a smoother transition between the circular arc and the straight arms would have been even closer to the exact value for all points on the 2D plane (the data space). More generally, this line of reasoning suggests that it would be interesting to investigate the features of a model that affect the quality of the Fisher Information Approximation.

B Supplementary Information on the Analysis of the Psychophysics Data

1.1 B.1 Technical Details of the Inference Procedure

Table 1. \(\hat{R}\) statistic and effective sample size (ESS) for 8 Markov Chain traces run as described in the main text. See [8] (Sections 11.4–11.5) and [27] for in-depth discussion of chain quality diagnostics. Briefly, \(\hat{R}\) depends on the relationship between the variance of the draws estimated within and between contiguous draw sequences. \(\hat{R}\) is close to 1 when the chains have successfully converged. The effective sample size estimates how many independent samples one would need to extract the same amount of information as that contained in the (correlated) MCMC draws. Note that here, for computational convenience, we report diagnostics for 8 chains with 1000 draws each, while the results reported in the main text have been obtained with 10 times as many draws (8 chains \(\times \) 10000 draws per chain), run with identical settings.

Posterior sampling was performed with PyMC3 [26] version 3.9.3, using the NUTS Hamiltonian Monte Carlo algorithm [14], with target acceptance probability set to 0.9. The posterior distributions reported in the main text are built by sampling 8 independent Markov chains for 10000 draws each. No divergence occurred in any of the chains. Effective sample size and \(\hat{R}\) diagnostics for some of the key parameters are given in table Table 1 for a shorter run of the same procedure.

1.2 B.2 Posterior Predictive Checks

Fig. 6.
figure 6

Simple posterior predictive check, looking at subject performance. A random sample of all subject-level parameters (\(\alpha _i\) and \(\beta _i\)) is taken at random from the MCMC chains used for model inference. Using those parameter values, a simulation of the experiment is run using the actual stimuli shown to the subjects, and the resulting performance of all 202 simulated subjects is recorded. This procedure is repeated 2000 times, yielding 2000 samples of the joint posterior-predictive distribution of task performance over all experimental subjects. To visualize this distribution, for each subject we plotted a cloud of 2000 dots where the y coordinate of each dot is the simulated performance of that subject in one of the simulations, and the x coordinate is the true performance of that subject in the experiment plus a small random jitter (for ease of visualization). The gray line is the identity, showing that our inference procedure captures well the behavioral patterns in the experimental data. (Color figure online)

We performed a simple posterior predictive check [18] to ensure that the Bayesian hierarchical model described in the main text captures the main pattern of behavior across our subjects. In Fig. 6, the behavioral performance of the subjects is compared with its posterior predictive distribution under the model. As can be seen from the figure, the performance of each subject is correctly captured by the model, across systematic differences between task types (with subjects performing better in the “vertical” task than the “rounded” task, for instance) as well as individual differences between subjects that performed the same task variant.

1.3 B.3 Formal Model Comparison

We compared the Bayesian hierarchical model described in the main text to a simpler model, where subjects were assumed to only be sensitive to likelihood differences, or in other words to choose \(\mathcal {M}_1\) over \(\mathcal {M}_2\) only based on which model was on average closer to the dot cloud constituting the stimulus on a given trial. Mathematically, this “likelihood only” model was equivalent to fixing all \(\beta \) parameters to zero except for \(\beta _L\) in the model described in the main text. All other details of the model were the same, and in particular the model still had a hierarchical structure with adaptive shrinkage (the subject-level parameters \(\alpha \) and \(\beta _L\) were modeled as coming from Student T distributions controlled by population-level parameters). We compared the full model and the likelihood-only using the Widely Applicable Information Criterion [9]. This comparison, shown in Table 2, reveals strong evidence in favor of the full model.

Table 2. WAIC comparison of the full model and the likelihood-only model for the experimental data, reported in the standard format used by [21] (Section 6.4.2). Briefly, WAIC is the value of the criterion (log-score scale—higher is better); pWAIC is the estimated effective number of parameters; dWAIC is the difference between the WAIC of the given model and the highest-ranked one; SE is the standard error of the WAIC estimate; and dSE is the standard error of the difference in WAIC. These estimates were produced with the compare function provided by ArviZ [19], using 8 MCMC chains with 1000 samples each for each model (in total, 8000 samples for each model).

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Piasini, E., Balasubramanian, V., Gold, J.I. (2022). Effect of Geometric Complexity on Intuitive Model Selection. In: Nicosia, G., et al. Machine Learning, Optimization, and Data Science. LOD 2021. Lecture Notes in Computer Science(), vol 13163. Springer, Cham. https://doi.org/10.1007/978-3-030-95467-3_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-95467-3_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-95466-6

  • Online ISBN: 978-3-030-95467-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics