Effect of Geometric Complexity on Intuitive Model Selection

Piasini, Eugenio; Balasubramanian, Vijay; Gold, Joshua I.

doi:10.1007/978-3-030-95467-3_1

Eugenio Piasini ORCID: orcid.org/0000-0003-0384-7699¹⁶,
Vijay Balasubramanian¹⁶ &
Joshua I. Gold¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 13163))

Included in the following conference series:

International Conference on Machine Learning, Optimization, and Data Science

1796 Accesses
3 Altmetric

Abstract

Occam’s razor is the principle stating that, all else being equal, simpler explanations for a set of observations are to be preferred to more complex ones. This idea can be made precise in the context of statistical inference, where the same quantitative notion of complexity of a statistical model emerges naturally from different approaches based on Bayesian model selection and information theory. The broad applicability of this mathematical formulation suggests a normative model of decision-making under uncertainty: complex explanations should be penalized according to this common measure of complexity. However, little is known about if and how humans intuitively quantify the relative complexity of competing interpretations of noisy data. Here we measure the sensitivity of naive human subjects to statistical model complexity. Our data show that human subjects bias their decisions in favor of simple explanations based not only on the dimensionality of the alternatives (number of model parameters), but also on finer-grained aspects of their geometry. In particular, as predicted by the theory, models intuitively judged as more complex are not only those with more parameters, but also those with larger volume and prominent curvature or boundaries. Our results imply that principled notions of statistical model complexity have direct quantitative relevance to human decision-making.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

How good is an explanation?

Article Open access 02 February 2023

How to Make a Decision Based on the Minimum Bayes Factor (MBF): Explanation of the Jeffreys Scale

On the Geometric Interplay Between Goodness-of-Fit and Estimation: Illustrative Examples

Notes

1.
In a similar way, one could define a two-dimensional model represented by a 2D area on the screen. This approach would be useful to provide an additional evaluation point for the dependence of the simplicity bias on model dimensionality. However, unlike a 0D or 1D model, a 2D model in a 2D data space will always suffer from boundary effects for data falling anywhere outside the model manifold. Therefore, because one primary goal of this study was to disentangle the distinct contributions of the models’ different geometrical features to the simplicity bias, we only use 1D models.

References

Abramowitz, M., Stegun, I.A.: Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables. Dover, New York (1972)
MATH Google Scholar
Amari, S.I., Nagaoka, H.: Methods of information geometry. Translations of Mathematical Monographs. American Mathematical Society (2000)
Google Scholar
Balasubramanian, V.: Statistical inference, occam’s razor, and statistical mechanics on the space of probability distributions. Neural Comput. 9(2), 349–368 (1997). https://doi.org/10.1162/neco.1997.9.2.349
Article MATH Google Scholar
Betancourt, M.: A conceptual introduction to hamiltonian monte carlo (2018). https://arxiv.org/abs/1701.02434
Bialek, W., Nemenman, I., Tishby, N.: Predictability, complexity and learning. Neural Comput. 13, 2409–2463 (2001). https://doi.org/10.1162/089976601753195969
Article MATH Google Scholar
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006). https://doi.org/10.1007/978-1-4615-7566-5
Book MATH Google Scholar
Efron, B., Hinkley, D.L.: Assessing the accuracy of the maximum likelihood estimator: observed versus expected fisher information. Biometrika 65(3), 457–483 (1978). https://doi.org/10.1093/biomet/65.3.457
Article MathSciNet MATH Google Scholar
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B.: Bayesian Data Analysis, 3rd edn. CRC Press, Boca Raton (2014)
MATH Google Scholar
Gelman, A., Hwang, J., Vehtari, A.: Understanding predictive information criteria for Bayesian models. Stat. Comput. 24(6), 997–1016 (2013). https://doi.org/10.1007/s11222-013-9416-2
Article MathSciNet MATH Google Scholar
Genewein, T., Braun, D.A.: Occam’s razor in sensorimotor learning. Proc. Roy. Soc. B Biol. Sci. 281(1783), 20132952 (2014). https://doi.org/10.1098/rspb.2013.2952
Article Google Scholar
Grünwald, P.D.: The Minimum Description Length Principle. MIT press, Cambridge (2007)
Book Google Scholar
Gull, S.F.: Bayesian inductive inference and maximum entropy. In: Erickson, G.J., Smith, C.R. (eds.) Maximum-Entropy and Bayesian Methods in Science and Engineering, pp. 53–74. Springer, Netherlands (1988). https://doi.org/10.1007/978-94-009-3049-0_4
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning, 2nd edn. Springer, Heidelberg (2009)
Book MATH Google Scholar
Hoffman, M.D., Gelman, A.: The No-U-Turn sampler: adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res. 15(47), 1593–1623 (2014). http://jmlr.org/papers/v15/hoffman14a.html
Jaynes, E.T.: Probability Theory: The Logic of Science. Cambridge University Press, Cambridge (2003)
Book MATH Google Scholar
Jeffreys, H.: Theory of Probability. Clarendon Press, Oxford (1939)
MATH Google Scholar
Johnson, S., Jin, A., Keil, F.: Simplicity and goodness-of-fit in explanation: the case of intuitive curve-fitting. In: Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 36, no. 36 (2014)
Google Scholar
Kruschke, J.K.: Doing Bayesian Data Analysis, 2nd edn. Academic Press, Cambridge (2015)
MATH Google Scholar
Kumar, R., Carroll, C., Hartikainen, A., Martin, O.: Arviz a unified library for exploratory analysis of bayesian models in python. J. Open Source Softw. 4(33), 1143 (2019). https://doi.org/10.21105/joss.01143
MacKay, D.J.C.: Bayesian interpolation. Neural Comput. 4(3), 415–447 (1992). https://doi.org/10.1162/neco.1992.4.3.415
Article MATH Google Scholar
McElreath, R.: Statistical Rethinking. CRC Press, Boca Raton (2016)
Google Scholar
Piasini, E., Balasubramanian, V., Gold, J.I.: Preregistration document (2016). https://doi.org/10.17605/OSF.IO/2X9H6
Piasini, E., Balasubramanian, V., Gold, J.I.: Preregistration document addendum. https://doi.org/10.17605/OSF.IO/5HDQZ
Piasini, E., Gold, J.I., Balasubramanian, V.: Information geometry of bayesian model selection (2021, unpublished)
Google Scholar
Rissanen, J.: Stochastic complexity and modeling. Ann. Stat. 14(3), 1080–1100 (1986). https://www.jstor.org/stable/3035559
Salvatier, J., Wiecki, T.V., Fonnesbeck, C.: Probabilistic programming in python using PyMC3. PeerJ Comput. Sci. 2, e55 (2016). https://doi.org/10.7717/peerj-cs.55
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., Bürkner, P.C.: Rank-normalization, folding, and localization: an improved $\hat{R}$ for assessing convergence of MCMC. Bayesian Analysis (2020). https://doi.org/10.1214/20-ba1221

Download references

Acknowledgements

We thank Chris Pizzica for help with setting up the web-based version of the experiments, and for managing subject recruitment. We acknowledge support or partial support from R01 NS113241 (EP) and R01 EB026945 (VB and JG).

Author information

Authors and Affiliations

Computational Neuroscience Initiative, University of Pennsylvania, Philadelphia, Pennsylvania, USA
Eugenio Piasini, Vijay Balasubramanian & Joshua I. Gold

Authors

Eugenio Piasini
View author publications
You can also search for this author in PubMed Google Scholar
Vijay Balasubramanian
View author publications
You can also search for this author in PubMed Google Scholar
Joshua I. Gold
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eugenio Piasini .

Editor information

Editors and Affiliations

University of Catania, Catania, Italy
Giuseppe Nicosia
Department of Computer Science, University of Reading, Reading, UK
Varun Ojha
Department of Computer Science, University of Oxford, Oxford, UK
Emanuele La Malfa
Cambridge Judge Business School, University of Cambridge, Cambridge, UK
Gabriele La Malfa
Department of Biochemistry, University of Cambridge, Cambridge, UK
Giorgio Jansen
Department of Industrial and Systems Engineering, University of Florida, Gainesville, FL, USA
Panos M. Pardalos
University of Catania, Catania, Italy
Giovanni Giuffrida
Department of Informatics, Dana-Farber Cancer Institute, Boston, MA, USA
Renato Umeton

Appendices

A Derivation of the Boundary Term in the Fisher Information Approximation

Here we generalize the derivation of the Fisher Information Approximation given by [3] to the case where the maximum likelihood solution for a model lies on the boundary of the parameter space. Apart from the more general assumptions, the following derivation follows closely the original one, with some minor notational changes.

1.1 A.1 Set-up and Hypotheses

The problem we consider here is that of selecting between two models (say $\mathcal {M}_1$ and $\mathcal {M}_2$), after observing empirical data $X=\left\{ x_i\right\} _{i=1}^N$. N is the sample size and $\mathcal {M}_1$ is assumed to have d parameters, collectively indexed as $\vartheta $ taking values in a compact domain $\varTheta $. As a prior over $\vartheta $ we take Jeffrey’s prior:

$$\begin{aligned} w(\vartheta ) = \frac{\sqrt{\det g(\vartheta )}}{\int \mathrm {d}\!\,^d\vartheta \sqrt{\det g(\vartheta )}} \end{aligned}$$

(13)

where g is the (expected) Fisher Information of the model $\mathcal {M}_1$:

$$\begin{aligned} g_{\mu \nu }(\vartheta ) = \mathbb {E}\left[ -\frac{\partial ^2\ln p(x|\vartheta )}{\partial \vartheta ^\mu \partial \vartheta ^\nu }\right] _{\vartheta } \end{aligned}$$

(14)

The Bayesian posterior

$$\begin{aligned} \mathbb {P}(\mathcal {M}_1|X) = \frac{\mathbb {P}(\mathcal {M}_1)}{\mathbb {P}(X)}\int \mathrm {d}\!\,^dw(\vartheta )\mathbb {P}(X|\vartheta ) \end{aligned}$$

(15)

then becomes, after assuming a flat prior over models and dropping irrelevant terms,

$$\begin{aligned} \mathbb {P}(\mathcal {M}_1|X) = \frac{\int _\varTheta \mathrm {d}\!\,^d\vartheta \sqrt{\det g}\exp [-N(-\frac{1}{N}\ln \mathbb {P}(X|\vartheta ))]}{\int \mathrm {d}\!\,^d\vartheta \sqrt{\det g}} \end{aligned}$$

(16)

Just as in [3], we now make a number of regularity assumptions: 1. $\ln \mathbb {P}(X|\vartheta )$ is smooth; 2. there is a unique global minimum $\hat{\vartheta }$ for $\ln \mathbb {P}(X|\vartheta )$; 3. $g_{\mu \nu }(\vartheta )$ is smooth; 4. $g_{\mu \nu }(\hat{\vartheta })$ is positive definite; 5. $\varTheta \subset \mathbb {R}^d$ is compact; and 6. the values of the local minima of $\ln \mathbb {P}(X|\vartheta )$ are bounded away from the global minimum by some $\epsilon >0$. Importantly, unlike in [3], we don’t assume that $\hat{\vartheta }$ is in the interior of $\varTheta $.

The Shape of $\varTheta $. Because we are specifically interested in understanding what happens at a boundary of the parameter space, we will add a further assumption that, while being not very restrictive in spirit, will allow us to derive a particularly interpretable result. In particular, we will assume that $\varTheta $ is specified by a single linear constraint of the form

$$\begin{aligned} D_\mu \vartheta ^\mu + d \ge 0 \end{aligned}$$

(17)

Without loss of generality, we’ll also take the constraint to be expressed in Hessian normal form—namely, $\Vert D_\mu \Vert =1$.

For clarity, note this assumption on the shape of $\varTheta $ is only used from Subsect. A.3 onward.

1.2 A.2 Preliminaries

We will now proceed to set up a low-temperature expansion of Eq. 16 around the saddle point $\hat{\vartheta }$. We start by rewriting the numerator in Eq. 16 as

$$\begin{aligned} \int _\varTheta \mathrm {d}\!\,^d\vartheta \exp \left[ -N\left( -\frac{1}{2N}\ln \det g - \frac{1}{N}\ln \mathbb {P}(X|\vartheta )\right) \right] \end{aligned}$$

(18)

The idea of the Fisher Information Approximation is to expand the integrand in Eq. 18 in powers of N around the maximum likelihood point $\hat{\vartheta }$. To this end, let’s define three useful objects:

We immediately note that

which is useful in order to compute

It is also useful to center the integration variables by introducing

$$\begin{aligned} \phi := \sqrt{N}(\vartheta -\hat{\vartheta })\end{aligned}$$

(19)

$$\begin{aligned} \mathrm {d}\!\,^d\phi = N^{d/2}\mathrm {d}\!\,^d\vartheta \end{aligned}$$

(20)

so that

(21)

and Eq. 18 becomes

$$\begin{aligned} \begin{aligned} \int \mathrm {d}\!\,^d\vartheta&\exp [-N\psi ] = N^{-d/2}\int \mathrm {d}\!\,^d\phi \exp \left[ -N\sum _{i=0}^\infty \frac{1}{i!}N^{-i/2}\left( \tilde{I}_{\mu _1\cdots \mu _i}-\frac{1}{2N}F_{\mu _1\cdots \mu _i}\right) \phi ^{\mu _1}\cdots \phi ^{\mu _i}\right] \\&= N^{-d/2}\int \mathrm {d}\!\,^d\phi {\text {exp}}\Bigg \{-N\left( -\frac{1}{N}\ln \mathbb {P}(X|\hat{\vartheta })-\frac{1}{2N}\ln \det g(\hat{\vartheta })\right) + \\&\quad -N\left[ \sum _{i=1}^\infty \frac{1}{i!}N^{-i/2}\left( \tilde{I}_{\mu _1\cdots \mu _i}-\frac{1}{2N}F_{\mu _1\cdots \mu _i}\right) \phi ^{\mu _1}\cdots \phi ^{\mu _i}\right] \Bigg \}\\&= N^{-\frac{d}{2}}\exp [-\left( -\ln \mathbb {P}(X|\hat{\vartheta })-\frac{1}{2}\ln \det g(\hat{\vartheta })\right) ]\times \\&\quad \times \int \mathrm {d}\!\,^d\phi \exp \Bigg \{-N\Bigg [\frac{1}{\sqrt{N}}\tilde{I}_\mu \phi ^\mu + \frac{1}{2N}\tilde{I}_{\mu \nu }\phi ^\mu \phi ^\nu + \\&\qquad + \frac{1}{N}\sum _{i=1}^\infty N^{-\frac{i}{2}}\left( \frac{1}{(i+2)!}\tilde{I}_{\mu _1\cdots \mu _{i+2}}\phi ^{\mu _1}\cdots \phi ^{\mu _{i+2}}-\frac{1}{2i!}F_{\mu _1\cdots \mu _i}\phi ^{\mu _1}\cdots \phi ^{\mu _i}\right) \Bigg ]\Bigg \} \end{aligned} \end{aligned}$$

Therefore,

$$\begin{aligned} \begin{aligned} \mathbb {P}(\mathcal {M}_1|X)&= N^{-\frac{d}{2}}\exp [-\left( -\ln \mathbb {P}(X|\hat{\vartheta })-\frac{1}{2}\ln \det g(\hat{\vartheta })+\ln \int \mathrm {d}\!\,^d\vartheta \sqrt{\det g}\right) ]\times \\&\quad \times \int \mathrm {d}\!\,^d\phi \exp \Bigg [-\sqrt{N}\tilde{I}_\mu \phi ^\mu - \frac{1}{2}\tilde{I}_{\mu \nu }\phi ^\mu \phi ^\nu + \\&\qquad - \sum _{i=1}^\infty N^{-\frac{i}{2}}\left( \frac{1}{(i+2)!}\tilde{I}_{\mu _1\cdots \mu _{i+2}}\phi ^{\mu _1}\cdots \phi ^{\mu _{i+2}}-\frac{1}{2i!}F_{\mu _1\cdots \mu _i}\phi ^{\mu _1}\cdots \phi ^{\mu _i}\right) \Bigg ]\Bigg \}\\&= N^{-\frac{d}{2}}\exp [-\left( -\ln \mathbb {P}(X|\hat{\vartheta })-\frac{1}{2}\ln \det g(\hat{\vartheta })+\ln \int _\varTheta \mathrm {d}\!\,^d\vartheta \sqrt{\det g}\right) ]\cdot Q \end{aligned} \end{aligned}$$

(22)

where

$$\begin{aligned} Q = \int _\varPhi \mathrm {d}\!\,^d\phi \exp \left[ -\sqrt{N}\tilde{I}_\mu \phi ^\mu - \frac{1}{2}\tilde{I}_{\mu \nu }\phi ^\mu \phi ^\nu - G(\phi )\right] \end{aligned}$$

(23)

and

$$\begin{aligned} G(\phi ) = \sum _{i=1}^\infty N^{-\frac{i}{2}}\left( \frac{1}{(i+2)!}\tilde{I}_{\mu _1\cdots \mu _{i+2}}\phi ^{\mu _1}\cdots \phi ^{\mu _{i+2}}-\frac{1}{2i!}F_{\mu _1\cdots \mu _i}\phi ^{\mu _1}\cdots \phi ^{\mu _i}\right) \end{aligned}$$

(24)

where $G(\phi )$ collects the terms that are suppressed by powers of N.

Our problem has been now reduced to computing Q by performing the integral in Eq. 23. Now our assumptions come into play for the key approximation step. For the sake of simplicity, assuming that N is large we drop $G(\phi )$ from the expression above, so that Q becomes a simple Gaussian integral with a linear term:

$$\begin{aligned} Q = \int _\varPhi \mathrm {d}\!\,^d\phi \exp \left[ -\sqrt{N}\tilde{I}_\mu \phi ^\mu - \frac{1}{2}\phi ^\mu \tilde{I}_{\mu \nu }\phi ^\nu \right] \end{aligned}$$

(25)

1.3 A.3 Choosing a Good System of Coordinates

Consider now the Observed Fisher Information at maximum likelihood, $\tilde{I}_{\mu \nu }$. As long as it is not singular, we can define its inverse $\varDelta ^{\mu \nu }=(\tilde{I}_{\mu \nu })^{-1}$. If $\tilde{I}_{\mu \nu }$ is positive definite, then the matrix representation of $\tilde{I}_{\mu \nu }$ will have a set of d positive eigenvalues which we will denote by $\{\sigma _{(1)}^{-2}, \sigma _{(2)}^{-2},\ldots ,\sigma _{(d)}^{-2}\}$. The matrix representation of $\varDelta ^{\mu \nu }$ will have eigenvalues $\{\sigma _{(1)}^{2}, \sigma _{(2)}^{2},\ldots ,\sigma _{(d)}^{2}\}$, and will be diagonal in the same choice of coordinates as $\tilde{I}_{\mu \nu }$. Denote by U the (orthogonal) diagonalizing matrix, i.e., U is such that

$$\begin{aligned} U\varDelta U^{\intercal } = \begin{bmatrix} \sigma ^2_{(1)}&{}0 &{}\cdots &{} 0 \\ 0 &{} \sigma ^2_{(2)} &{} &{}\vdots \\ \vdots &{} &{} \ddots &{}0\\ 0 &{} \ldots &{} 0 &{} \sigma ^2_{(d)} \end{bmatrix}\quad ,\quad U^\intercal U = U U^\intercal = \mathbb {I} \end{aligned}$$

(26)

Define also the matrix K as the product of the diagonal matrix with elements $1/\sigma _{(k)}$ along the diagonal and U:

$$\begin{aligned} K = \begin{bmatrix} 1/\sigma _{(1)}&{}0 &{}\cdots &{} 0 \\ 0 &{} 1/\sigma _{(2)} &{} &{}\vdots \\ \vdots &{} &{} \ddots &{}0\\ 0 &{} \ldots &{} 0 &{} 1/\sigma _{(d)} \end{bmatrix} U \end{aligned}$$

(27)

Note that

$$ \det K = \left( \det \varDelta ^{\mu \nu }\right) ^{-1/2} = \sqrt{\det \tilde{I}_{\mu \nu }} $$

and that K corresponds to a sphering transformation, in the sense that

$$\begin{aligned} K\varDelta K^\intercal =\mathbb {I} \quad \text { or }\quad K^{\mu }_{\;\;\kappa }\varDelta ^{\kappa \lambda }K^\nu _{\;\;\lambda } = \delta ^{{\mu \nu }} \end{aligned}$$

(28)

and therefore, if we define the inverse

$$ P = K^{-1} $$

we have

$$\begin{aligned} P^\intercal (\tilde{I}_{\mu \nu }) P=\mathbb {I} \quad \text { or }\quad P^{\kappa }_{\;\;\mu }\tilde{I}_{\kappa \lambda }P^\lambda _{\;\;\nu } = \delta _{{\mu \nu }} \end{aligned}$$

(29)

We can now define a new set of coordinates by centering and sphering, as follows:

$$\begin{aligned} \xi ^\mu = K^\mu _{\;\;\nu } \left( \phi ^\nu + \sqrt{N}\varDelta ^{\nu \kappa }\tilde{I}_\kappa \right) \end{aligned}$$

(30)

Then,

$$\begin{aligned} \mathrm {d}\!\,^d\xi = \sqrt{\det \tilde{I}_{\mu \nu }}\mathrm {d}\!\,^d\phi \end{aligned}$$

(31)

and

$$\begin{aligned} \phi ^\mu = P^\mu _{\;\;\nu }\xi ^\nu - \sqrt{N}\varDelta ^{\mu \nu }\tilde{I}_\nu \end{aligned}$$

(32)

In this new set of coordinates,

$$\begin{aligned}&-\sqrt{N}\tilde{I}_\nu \phi ^\nu -\frac{1}{2}\phi ^\mu \tilde{I}_{\mu \nu }\phi ^\nu = \nonumber \\&\qquad \qquad \qquad \qquad \;\,=-\left( \sqrt{N}\tilde{I}_\nu +\frac{1}{2}\phi ^\mu \tilde{I}_{\mu \nu }\right) \phi ^\nu \nonumber \\&\qquad \qquad \;\; =-\left( \sqrt{N}\tilde{I}_\nu + \frac{1}{2}P^\mu _{\;\;\kappa }\xi ^\kappa \tilde{I}_{\mu \nu }\frac{1}{2}\sqrt{N}\varDelta ^{\mu \kappa }\tilde{I}_\kappa \tilde{I}_{\mu \nu }\right) \phi ^\nu \nonumber \\&=-\sqrt{N}\tilde{I}_\nu P^\nu _{\;\;\lambda }\xi ^\lambda + N\varDelta ^{\nu \lambda }\tilde{I}_\lambda \tilde{I}_\nu - \frac{1}{2}P^\mu _\kappa \xi ^\kappa \tilde{I}_{\mu \nu }P^\nu _{\;\;\lambda }\xi ^\lambda + \frac{\sqrt{N}}{2}P^\mu _{\;\;\kappa }\xi ^\kappa \tilde{I}_{{\mu \nu }}\varDelta ^{\nu \lambda }\tilde{I}_\lambda + \nonumber \\&\qquad \qquad \qquad \,\qquad \;\,+ \frac{\sqrt{N}}{2}\varDelta ^{\mu \kappa }\tilde{I}_\kappa \tilde{I}_{\mu \nu }P^\nu _{\;\;\lambda }\xi ^\lambda - \frac{N}{2}\varDelta ^{\mu \kappa }\tilde{I}_\kappa \tilde{I}_{\mu \nu }\varDelta ^{\nu \lambda }\tilde{I}_\lambda \nonumber \\&\qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \quad \;\;\; =\frac{N}{2}\tilde{I}_\nu \varDelta ^{\nu \lambda }\tilde{I}_\lambda - \frac{1}{2}\xi ^\kappa \delta _{\kappa \lambda }\xi ^\lambda \end{aligned}$$

(33)

where we have used Eq. 29 as well as the fact that $\varDelta ^{\mu \nu }=\varDelta ^{\nu \mu }$ and that $\varDelta ^{\mu \kappa }\tilde{I}_{\kappa \nu }=\delta ^\mu _{\;\;\nu }$ by definition.

Therefore, putting Eq. 31 and Eq. 33 together, Eq. 25 becomes

$$\begin{aligned} Q = \frac{\exp [\frac{N}{2}\tilde{I}_\mu \varDelta ^{\mu \nu }\tilde{I}_\nu ]}{\sqrt{\det \tilde{I}_{\mu \nu }}}\int _\varXi \mathrm {d}\!\,^d\xi \exp [-\frac{1}{2}\xi _\mu \delta ^{{\mu \nu }}\xi _\nu ] \end{aligned}$$

(34)

The problem is reduced to a (truncated) spherical gaussian integral, where the domain of integration $\varXi $ will depend on the original domain $\varTheta $ but also on $\tilde{I}_\mu $, $\tilde{I}_{\mu \nu }$ and $\hat{\vartheta }$. To complete the calculation, we now need to make this dependence explicit.

1.4 A.4 Determining the Domain of Integration

We start by combining Eq. 19 and Eq. 32 to yield

$$\begin{aligned} \vartheta ^\mu =\frac{1}{\sqrt{N}}P^\mu _{\;\;\nu }\xi ^\nu - \varDelta ^{\mu \nu }\tilde{I}_\nu +\hat{\vartheta }^\mu \end{aligned}$$

(35)

By substituting Eq. 35 into Eq. 17 we get

$$\begin{aligned} D_\mu \left( \frac{P^\mu _{\;\;\nu }\xi ^\nu }{\sqrt{N}} - \varDelta ^{\mu \nu }\tilde{I}_\nu +\hat{\vartheta }^\mu \right) + d \ge 0 \end{aligned}$$

which we can rewrite as

$$\begin{aligned} \tilde{D}_\mu \xi ^\mu + \tilde{d}\ge 0 \end{aligned}$$

(36)

with

$$\begin{aligned} \tilde{D}_\mu := \frac{1}{\sqrt{N}} D_\nu P^{\nu }_{\;\;\mu } \end{aligned}$$

(37)

and

$$\begin{aligned} \begin{aligned} \tilde{d}&:= d + D_\mu \hat{\vartheta }^\mu - D_\mu \varDelta ^{\mu \nu }\tilde{I}_\nu \\&= d + D_\mu \hat{\vartheta }^\mu - \langle D_\mu ,\tilde{I}_\mu \rangle _\varDelta \end{aligned} \end{aligned}$$

(38)

where by $\langle \cdot ,\cdot \rangle _\varDelta $ we mean the inner product in the inverse observed Fisher information metric. Now, note that whenever $\tilde{I}_\mu $ is not zero it will be parallel to $D_\mu $. Indeed, by construction of the maximum likelihood point $\hat{\vartheta }$, the gradient of the log likelihood can only be orthogonal to the boundary at $\hat{\vartheta }$, and pointing towards the outside of the domain; therefore $\tilde{I}_\mu $, which is defined as minus the gradient, will point inward. At the same time, $D_\mu $ will also always point toward the interior of the domain because of the form of the constraint we have chosen in Eq. 17. Because by assumption $\Vert D_\mu \Vert =1$, we have that

$$ \tilde{I}_\mu = \Vert \tilde{I}_\nu \Vert D_\mu $$

and

$$ \langle D_\mu ,\tilde{I}_\mu \rangle _\varDelta = \Vert D_\nu \Vert _\varDelta \cdot \Vert \tilde{I}_\nu \Vert _\varDelta $$

so that

$$\begin{aligned} \tilde{d}= d + D_\mu \hat{\vartheta }^\mu - \Vert D_\mu \Vert _\varDelta \cdot \Vert \tilde{I}_\mu \Vert _\varDelta \end{aligned}$$

(39)

Now, the signed distance of the boundary to the origin in $\xi $-space is

$$ l = -\frac{\tilde{d}}{\Vert \tilde{D}_\mu \Vert } $$

where the sign is taken such that l is negative when the origin is included in the integration domain. But noting that

$$ K^\mu _{\;\;\kappa }\varDelta ^{\kappa \lambda }K^\nu _{\;\;\lambda }=\delta ^{{\mu \nu }} \quad \Rightarrow \quad \varDelta ^{\mu \nu }= P^\mu _{\;\;\kappa }\delta ^{\kappa \lambda }P^\nu _{\;\;\lambda } $$

we have

$$ \begin{aligned} \Vert \tilde{D}_\mu \Vert&= \sqrt{\tilde{D}_\mu \delta ^{{\mu \nu }}\tilde{D}_\nu } = \sqrt{\frac{1}{N}D_\kappa \left( P^\kappa _{\;\;\mu }\delta ^{\mu \nu }P^\lambda _{\;\;\nu }\right) D_\lambda }\\&=\sqrt{\frac{1}{N}D_\kappa \varDelta ^{\kappa \lambda }D_\lambda } =\frac{\Vert D_\mu \Vert _\varDelta }{\sqrt{N}} \end{aligned} $$

and therefore

$$\begin{aligned} l = -\sqrt{N}\frac{\tilde{d}}{\Vert D_\mu \Vert } \end{aligned}$$

(40)

Finally, by plugging Eq. 39 into Eq. 40 we obtain

$$\begin{aligned} \begin{aligned} l&= -\sqrt{N}\left[ \frac{d+D_\mu \hat{\vartheta }^\mu }{\Vert D_\mu \Vert _\varDelta } - \Vert \tilde{I}_\mu \Vert _\varDelta \right] \\&=:\sqrt{2}\left( s-m\right) \end{aligned} \end{aligned}$$

(41)

where m and s are defined for convenience like so:

$$\begin{aligned} m:= \sqrt{\frac{N}{2}}\frac{d+D_\mu \hat{\vartheta }^\mu }{\Vert D_\mu \Vert _\varDelta }\quad (\ge 0)\end{aligned}$$

(42)

$$\begin{aligned} s:= \sqrt{\frac{N}{2}}\Vert \tilde{I}_\mu \Vert _\varDelta \quad (\ge 0) \end{aligned}$$

(43)

We note that m is a rescaled version of the margin defined by the constraint on the parameters (and therefore is never negative by assumption), and s is a rescaled version of the norm of the gradient of the log likelihood in the inverse observed Fisher metric (and therefore is nonnegative by construction).

1.5 A.5 Computing the Penalty

We can now perform a final change of variables in the integral in Eq. 34. We rotate our coordinates to align them to the boundary, so that

$$ \tilde{D}_\mu = (\Vert \tilde{D}_\mu \Vert ,0,0,\ldots ,0) $$

Note that we can always do this as our integrand is invariant under rotation. In this coordinate system, Eq. 34 factorizes:

$$\begin{aligned} \begin{aligned} Q&= \frac{\exp [\frac{N}{2}\tilde{I}_\mu \varDelta ^{\mu \nu }\tilde{I}_\nu ]}{\sqrt{\det \tilde{I}_{\mu \nu }}}\int _{\mathbb {R}^{d-1}}\mathrm {d}\!\,^{d-1}\xi \exp [-\frac{\xi _\mu \delta ^{{\mu \nu }}\xi _\nu }{2}] \int _{l}^\infty \mathrm {d}\!\,\zeta \exp [-\frac{\zeta ^2}{2}]\\&=\sqrt{\frac{(2\pi )^d}{\det \tilde{I}_{\mu \nu }}}\exp [\frac{N}{2}\Vert \tilde{I}\Vert _\varDelta ^2]\frac{1}{\sqrt{\pi }}\int _{l}^\infty \frac{\mathrm {d}\!\,\zeta }{\sqrt{2}} \exp [-\frac{\zeta ^2}{2}]\\&=\sqrt{\frac{(2\pi )^d}{\det \tilde{I}_{\mu \nu }}}\exp (s^2)\frac{1}{\sqrt{\pi }}\int _{l/\sqrt{2}}^\infty \mathrm {d}\!\,\zeta \exp [-\zeta ^2]\\&=\sqrt{\frac{(2\pi )^d}{\det \tilde{I}_{\mu \nu }}}\exp (s^2)\frac{\mathrm {erfc}(s-m)}{2} \end{aligned} \end{aligned}$$

(44)

where $\mathrm {erfc}(\cdot )$ is the complementary error function ([1], [Section 7.1.2]).

Finally, plugging Eq. 44 into Eq. 22 and taking the log, we obtain the extended FIA:

$$\begin{aligned} -\ln \mathbb {P}(\mathcal {M}_1|E) \simeq \ln \mathbb {P}(E|\hat{\vartheta })+\frac{d}{2}\ln \frac{N}{2\pi } + \ln \int _\varTheta \mathrm {d}\!\,^d\vartheta \sqrt{\det g} + \frac{1}{2}\ln \left[ \frac{\det \tilde{I}_{\mu \nu }}{\det g_{\mu \nu }}\right] + S \end{aligned}$$

(45)

where

$$\begin{aligned} S:= \ln (2) - \ln \left[ \exp (s^2)\mathrm {erfc}(s-m)\right] \end{aligned}$$

(46)

can be interpreted as a penalty arising from the presence of the boundary in parameter space.

1.6 A.6 Interpreting the Penalty

We will now take a closer look at Eq. 46. To do this, one key observation we will use is that, by construction, at most one of m and s is ever nonzero. This is because in the interior of the manifold, $m>0$ by definition, but $s=0$ because the gradient of the likelihood is zero at $\hat{\vartheta }$; and on the boundary, $m=0$ by definition, and s can be either zero or positive.

Interior of the Manifold. When $\hat{\vartheta }$ is in the interior of the parameter space $\varTheta $, then $\tilde{I}_\mu =0\Rightarrow s=0$ and Eq. 46 simplifies to

$$\begin{aligned} S = \ln (2) -\ln (\mathrm {erfc}(-m)) \end{aligned}$$

(47)

but since N is large we have $m\gg 0$, $\mathrm {erfc}(-m)\rightarrow 2$ and $S\rightarrow 0$, so our result passes the first sanity check: we recover the expression in [3].

Boundary of the Manifold. When $\hat{\vartheta }$ is on the boundary of $\varTheta $, $m=0$ and $s\ge 0$. Equation 46 becomes

$$\begin{aligned} S=\ln (2) - \ln \left[ \exp (s^2)\mathrm {erfc}(s)\right] = \ln (2) - \ln (w(is)) \end{aligned}$$

(48)

where w is the Feddeeva function ([1], [Section 7.1.3]):

$$\begin{aligned} w(z) = e^{-z^2}\mathrm {erfc}(-iz) \end{aligned}$$

This function is tabulated and can be computed efficiently. However, it is interesting to analyze its limiting behavior.

As a consistency check, when s is small we have at fixed N, to first order:

$$\begin{aligned} \begin{aligned} S&\simeq \ln (2) - \ln (1-\frac{2s}{\sqrt{\pi }})\\&\simeq \ln (2) + \frac{2s}{\sqrt{\pi }}=\ln (2) + \sqrt{\frac{2N}{\pi }}\Vert \tilde{I}_\mu \Vert _\varDelta \end{aligned} \end{aligned}$$

(49)

and $S=\ln (2)$ when $\tilde{I}_\mu =0$, as expected.

However, the real case of interest is the behavior of the penalty when N is assumed to be large, as this is consistent with the fact that we derived Eq. 44 as an asymptotic expansion of Eq. 23. In this case, using the asymptotic expansion for the Feddeeva function ([1], [Section 7.1.23]):

$$\begin{aligned} \exp [s^2]\mathrm {erfc}(s)\sim \frac{1}{s\sqrt{\pi }}\left[ 1+\sum _{m=1}^\infty (-1)^m\frac{1\cdot 3\cdots (2m-1)}{(2s^2)^m}\right] \end{aligned}$$

To leading order we obtain

$$\begin{aligned} \begin{aligned} S&\simeq \ln (2)+\ln (s\sqrt{\pi })\\&= \ln (2) + \ln (\sqrt{\frac{N\pi }{2}}\Vert \tilde{I}_\mu \Vert _\varDelta ) \end{aligned} \end{aligned}$$

which we can rewrite as

$$\begin{aligned} \boxed {S\simeq \frac{1}{2}\ln \frac{N}{2\pi } + \ln \left[ 2\pi \Vert \tilde{I}_\mu \Vert _\varDelta \right] } \end{aligned}$$

(50)

We can summarize the above by saying that a new penalty term of order $\ln N$ arose due to the presence of the boundary. Interestingly, comparing Eq. 50 with Eq. 45 we see that the first term in Eq. 50 is analogous to counting an extra parameter dimension in the original Fisher Information Approximation.

1.7 A.7 Numerical Comparison of the Extended FIA vs Exact Bayes

Figure 5 shows that the FIA computed with the expressions given above provides a very good approximation to the exact Bayesian log posterior ratio (LPR) for the model pairs used in the psychophysics experiments, and for the chosen sample size ($N=10$). As highlighted in the panels in the rightmost column, the discrepancies between the exact and the approximated LPR are generally small in relative terms, and therefore are not very important for the purpose of model fitting and interpretation. Note that here, as well as for the results in the main text, the S term in the FIA is computed using Eq. 46 rather than Eq. 50 in order to avoid infinities (that for finite N can arise when the likelihood gradient is very small) and discontinuities (that for finite N can arise on the interior of the manifold, in proximity to the boundary, where the value of S goes from zero when $\hat{\vartheta }$ is in the interior to $\log (2)$ when $\hat{\vartheta }$ is exactly on the boundary).

Even though overall the agreement between the approximation is good, it is interesting to look more closely at where it is the least so. The task type for which the discrepancies are the largest (both in absolute and relative terms) is the “rounded” type (third row in Fig. 5). This is because the FIA hypotheses are not fully satisfied everywhere for one of the models. More specifically, the models in that task variant are a circular arc (the bottom model in Fig. 5, third row) and a smaller circular arc, concentric with the first, with a straight segment attached to either side (the top model). The log-likelihood function for this second model is only smooth to first order, but its second derivative (and therefore its Fisher Information and its observed Fisher Information) are not continuous at the points where the circular arc is joined with the straight segments, locally breaking hypothesis number 3 in Subsect. A.1. Geometrically, this is analogous to saying that the curvature of the manifold changes abruptly at the joints. It is likely that the FIA for a model with a smoother transition between the circular arc and the straight arms would have been even closer to the exact value for all points on the 2D plane (the data space). More generally, this line of reasoning suggests that it would be interesting to investigate the features of a model that affect the quality of the Fisher Information Approximation.

B Supplementary Information on the Analysis of the Psychophysics Data

1.1 B.1 Technical Details of the Inference Procedure

Table 1. $\hat{R}$ statistic and effective sample size (ESS) for 8 Markov Chain traces run as described in the main text. See [8] (Sections 11.4–11.5) and [27] for in-depth discussion of chain quality diagnostics. Briefly, $\hat{R}$ depends on the relationship between the variance of the draws estimated within and between contiguous draw sequences. $\hat{R}$ is close to 1 when the chains have successfully converged. The effective sample size estimates how many independent samples one would need to extract the same amount of information as that contained in the (correlated) MCMC draws. Note that here, for computational convenience, we report diagnostics for 8 chains with 1000 draws each, while the results reported in the main text have been obtained with 10 times as many draws (8 chains $\times $ 10000 draws per chain), run with identical settings.

Full size table

Posterior sampling was performed with PyMC3 [26] version 3.9.3, using the NUTS Hamiltonian Monte Carlo algorithm [14], with target acceptance probability set to 0.9. The posterior distributions reported in the main text are built by sampling 8 independent Markov chains for 10000 draws each. No divergence occurred in any of the chains. Effective sample size and $\hat{R}$ diagnostics for some of the key parameters are given in table Table 1 for a shorter run of the same procedure.

1.2 B.2 Posterior Predictive Checks

We performed a simple posterior predictive check [18] to ensure that the Bayesian hierarchical model described in the main text captures the main pattern of behavior across our subjects. In Fig. 6, the behavioral performance of the subjects is compared with its posterior predictive distribution under the model. As can be seen from the figure, the performance of each subject is correctly captured by the model, across systematic differences between task types (with subjects performing better in the “vertical” task than the “rounded” task, for instance) as well as individual differences between subjects that performed the same task variant.

1.3 B.3 Formal Model Comparison

We compared the Bayesian hierarchical model described in the main text to a simpler model, where subjects were assumed to only be sensitive to likelihood differences, or in other words to choose $\mathcal {M}_1$ over $\mathcal {M}_2$ only based on which model was on average closer to the dot cloud constituting the stimulus on a given trial. Mathematically, this “likelihood only” model was equivalent to fixing all $\beta $ parameters to zero except for $\beta _L$ in the model described in the main text. All other details of the model were the same, and in particular the model still had a hierarchical structure with adaptive shrinkage (the subject-level parameters $\alpha $ and $\beta _L$ were modeled as coming from Student T distributions controlled by population-level parameters). We compared the full model and the likelihood-only using the Widely Applicable Information Criterion [9]. This comparison, shown in Table 2, reveals strong evidence in favor of the full model.

Table 2. WAIC comparison of the full model and the likelihood-only model for the experimental data, reported in the standard format used by [21] (Section 6.4.2). Briefly, WAIC is the value of the criterion (log-score scale—higher is better); pWAIC is the estimated effective number of parameters; dWAIC is the difference between the WAIC of the given model and the highest-ranked one; SE is the standard error of the WAIC estimate; and dSE is the standard error of the difference in WAIC. These estimates were produced with the compare function provided by ArviZ [19], using 8 MCMC chains with 1000 samples each for each model (in total, 8000 samples for each model).