Abstract
Occam’s razor is the principle stating that, all else being equal, simpler explanations for a set of observations are to be preferred to more complex ones. This idea can be made precise in the context of statistical inference, where the same quantitative notion of complexity of a statistical model emerges naturally from different approaches based on Bayesian model selection and information theory. The broad applicability of this mathematical formulation suggests a normative model of decision-making under uncertainty: complex explanations should be penalized according to this common measure of complexity. However, little is known about if and how humans intuitively quantify the relative complexity of competing interpretations of noisy data. Here we measure the sensitivity of naive human subjects to statistical model complexity. Our data show that human subjects bias their decisions in favor of simple explanations based not only on the dimensionality of the alternatives (number of model parameters), but also on finer-grained aspects of their geometry. In particular, as predicted by the theory, models intuitively judged as more complex are not only those with more parameters, but also those with larger volume and prominent curvature or boundaries. Our results imply that principled notions of statistical model complexity have direct quantitative relevance to human decision-making.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
In a similar way, one could define a two-dimensional model represented by a 2D area on the screen. This approach would be useful to provide an additional evaluation point for the dependence of the simplicity bias on model dimensionality. However, unlike a 0D or 1D model, a 2D model in a 2D data space will always suffer from boundary effects for data falling anywhere outside the model manifold. Therefore, because one primary goal of this study was to disentangle the distinct contributions of the models’ different geometrical features to the simplicity bias, we only use 1D models.
References
Abramowitz, M., Stegun, I.A.: Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables. Dover, New York (1972)
Amari, S.I., Nagaoka, H.: Methods of information geometry. Translations of Mathematical Monographs. American Mathematical Society (2000)
Balasubramanian, V.: Statistical inference, occam’s razor, and statistical mechanics on the space of probability distributions. Neural Comput. 9(2), 349–368 (1997). https://doi.org/10.1162/neco.1997.9.2.349
Betancourt, M.: A conceptual introduction to hamiltonian monte carlo (2018). https://arxiv.org/abs/1701.02434
Bialek, W., Nemenman, I., Tishby, N.: Predictability, complexity and learning. Neural Comput. 13, 2409–2463 (2001). https://doi.org/10.1162/089976601753195969
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2006). https://doi.org/10.1007/978-1-4615-7566-5
Efron, B., Hinkley, D.L.: Assessing the accuracy of the maximum likelihood estimator: observed versus expected fisher information. Biometrika 65(3), 457–483 (1978). https://doi.org/10.1093/biomet/65.3.457
Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., Rubin, D.B.: Bayesian Data Analysis, 3rd edn. CRC Press, Boca Raton (2014)
Gelman, A., Hwang, J., Vehtari, A.: Understanding predictive information criteria for Bayesian models. Stat. Comput. 24(6), 997–1016 (2013). https://doi.org/10.1007/s11222-013-9416-2
Genewein, T., Braun, D.A.: Occam’s razor in sensorimotor learning. Proc. Roy. Soc. B Biol. Sci. 281(1783), 20132952 (2014). https://doi.org/10.1098/rspb.2013.2952
Grünwald, P.D.: The Minimum Description Length Principle. MIT press, Cambridge (2007)
Gull, S.F.: Bayesian inductive inference and maximum entropy. In: Erickson, G.J., Smith, C.R. (eds.) Maximum-Entropy and Bayesian Methods in Science and Engineering, pp. 53–74. Springer, Netherlands (1988). https://doi.org/10.1007/978-94-009-3049-0_4
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning, 2nd edn. Springer, Heidelberg (2009)
Hoffman, M.D., Gelman, A.: The No-U-Turn sampler: adaptively setting path lengths in hamiltonian monte carlo. J. Mach. Learn. Res. 15(47), 1593–1623 (2014). http://jmlr.org/papers/v15/hoffman14a.html
Jaynes, E.T.: Probability Theory: The Logic of Science. Cambridge University Press, Cambridge (2003)
Jeffreys, H.: Theory of Probability. Clarendon Press, Oxford (1939)
Johnson, S., Jin, A., Keil, F.: Simplicity and goodness-of-fit in explanation: the case of intuitive curve-fitting. In: Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 36, no. 36 (2014)
Kruschke, J.K.: Doing Bayesian Data Analysis, 2nd edn. Academic Press, Cambridge (2015)
Kumar, R., Carroll, C., Hartikainen, A., Martin, O.: Arviz a unified library for exploratory analysis of bayesian models in python. J. Open Source Softw. 4(33), 1143 (2019). https://doi.org/10.21105/joss.01143
MacKay, D.J.C.: Bayesian interpolation. Neural Comput. 4(3), 415–447 (1992). https://doi.org/10.1162/neco.1992.4.3.415
McElreath, R.: Statistical Rethinking. CRC Press, Boca Raton (2016)
Piasini, E., Balasubramanian, V., Gold, J.I.: Preregistration document (2016). https://doi.org/10.17605/OSF.IO/2X9H6
Piasini, E., Balasubramanian, V., Gold, J.I.: Preregistration document addendum. https://doi.org/10.17605/OSF.IO/5HDQZ
Piasini, E., Gold, J.I., Balasubramanian, V.: Information geometry of bayesian model selection (2021, unpublished)
Rissanen, J.: Stochastic complexity and modeling. Ann. Stat. 14(3), 1080–1100 (1986). https://www.jstor.org/stable/3035559
Salvatier, J., Wiecki, T.V., Fonnesbeck, C.: Probabilistic programming in python using PyMC3. PeerJ Comput. Sci. 2, e55 (2016). https://doi.org/10.7717/peerj-cs.55
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., Bürkner, P.C.: Rank-normalization, folding, and localization: an improved \(\hat{R}\) for assessing convergence of MCMC. Bayesian Analysis (2020). https://doi.org/10.1214/20-ba1221
Acknowledgements
We thank Chris Pizzica for help with setting up the web-based version of the experiments, and for managing subject recruitment. We acknowledge support or partial support from R01 NS113241 (EP) and R01 EB026945 (VB and JG).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
A Derivation of the Boundary Term in the Fisher Information Approximation
Here we generalize the derivation of the Fisher Information Approximation given by [3] to the case where the maximum likelihood solution for a model lies on the boundary of the parameter space. Apart from the more general assumptions, the following derivation follows closely the original one, with some minor notational changes.
1.1 A.1 Set-up and Hypotheses
The problem we consider here is that of selecting between two models (say \(\mathcal {M}_1\) and \(\mathcal {M}_2\)), after observing empirical data \(X=\left\{ x_i\right\} _{i=1}^N\). N is the sample size and \(\mathcal {M}_1\) is assumed to have d parameters, collectively indexed as \(\vartheta \) taking values in a compact domain \(\varTheta \). As a prior over \(\vartheta \) we take Jeffrey’s prior:
where g is the (expected) Fisher Information of the model \(\mathcal {M}_1\):
The Bayesian posterior
then becomes, after assuming a flat prior over models and dropping irrelevant terms,
Just as in [3], we now make a number of regularity assumptions: 1. \(\ln \mathbb {P}(X|\vartheta )\) is smooth; 2. there is a unique global minimum \(\hat{\vartheta }\) for \(\ln \mathbb {P}(X|\vartheta )\); 3. \(g_{\mu \nu }(\vartheta )\) is smooth; 4. \(g_{\mu \nu }(\hat{\vartheta })\) is positive definite; 5. \(\varTheta \subset \mathbb {R}^d\) is compact; and 6. the values of the local minima of \(\ln \mathbb {P}(X|\vartheta )\) are bounded away from the global minimum by some \(\epsilon >0\). Importantly, unlike in [3], we don’t assume that \(\hat{\vartheta }\) is in the interior of \(\varTheta \).
The Shape of \(\varTheta \). Because we are specifically interested in understanding what happens at a boundary of the parameter space, we will add a further assumption that, while being not very restrictive in spirit, will allow us to derive a particularly interpretable result. In particular, we will assume that \(\varTheta \) is specified by a single linear constraint of the form
Without loss of generality, we’ll also take the constraint to be expressed in Hessian normal form—namely, \(\Vert D_\mu \Vert =1\).
For clarity, note this assumption on the shape of \(\varTheta \) is only used from Subsect. A.3 onward.
1.2 A.2 Preliminaries
We will now proceed to set up a low-temperature expansion of Eq. 16 around the saddle point \(\hat{\vartheta }\). We start by rewriting the numerator in Eq. 16 as
The idea of the Fisher Information Approximation is to expand the integrand in Eq. 18 in powers of N around the maximum likelihood point \(\hat{\vartheta }\). To this end, let’s define three useful objects:

We immediately note that

which is useful in order to compute

It is also useful to center the integration variables by introducing
so that

and Eq. 18 becomes
Therefore,
where
and
where \(G(\phi )\) collects the terms that are suppressed by powers of N.
Our problem has been now reduced to computing Q by performing the integral in Eq. 23. Now our assumptions come into play for the key approximation step. For the sake of simplicity, assuming that N is large we drop \(G(\phi )\) from the expression above, so that Q becomes a simple Gaussian integral with a linear term:
1.3 A.3 Choosing a Good System of Coordinates
Consider now the Observed Fisher Information at maximum likelihood, \(\tilde{I}_{\mu \nu }\). As long as it is not singular, we can define its inverse \(\varDelta ^{\mu \nu }=(\tilde{I}_{\mu \nu })^{-1}\). If \(\tilde{I}_{\mu \nu }\) is positive definite, then the matrix representation of \(\tilde{I}_{\mu \nu }\) will have a set of d positive eigenvalues which we will denote by \(\{\sigma _{(1)}^{-2}, \sigma _{(2)}^{-2},\ldots ,\sigma _{(d)}^{-2}\}\). The matrix representation of \(\varDelta ^{\mu \nu }\) will have eigenvalues \(\{\sigma _{(1)}^{2}, \sigma _{(2)}^{2},\ldots ,\sigma _{(d)}^{2}\}\), and will be diagonal in the same choice of coordinates as \(\tilde{I}_{\mu \nu }\). Denote by U the (orthogonal) diagonalizing matrix, i.e., U is such that
Define also the matrix K as the product of the diagonal matrix with elements \(1/\sigma _{(k)}\) along the diagonal and U:
Note that
and that K corresponds to a sphering transformation, in the sense that
and therefore, if we define the inverse
we have
We can now define a new set of coordinates by centering and sphering, as follows:
Then,
and
In this new set of coordinates,
where we have used Eq. 29 as well as the fact that \(\varDelta ^{\mu \nu }=\varDelta ^{\nu \mu }\) and that \(\varDelta ^{\mu \kappa }\tilde{I}_{\kappa \nu }=\delta ^\mu _{\;\;\nu }\) by definition.
Therefore, putting Eq. 31 and Eq. 33 together, Eq. 25 becomes
The problem is reduced to a (truncated) spherical gaussian integral, where the domain of integration \(\varXi \) will depend on the original domain \(\varTheta \) but also on \(\tilde{I}_\mu \), \(\tilde{I}_{\mu \nu }\) and \(\hat{\vartheta }\). To complete the calculation, we now need to make this dependence explicit.
1.4 A.4 Determining the Domain of Integration
We start by combining Eq. 19 and Eq. 32 to yield
By substituting Eq. 35 into Eq. 17 we get
which we can rewrite as
with
and
where by \(\langle \cdot ,\cdot \rangle _\varDelta \) we mean the inner product in the inverse observed Fisher information metric. Now, note that whenever \(\tilde{I}_\mu \) is not zero it will be parallel to \(D_\mu \). Indeed, by construction of the maximum likelihood point \(\hat{\vartheta }\), the gradient of the log likelihood can only be orthogonal to the boundary at \(\hat{\vartheta }\), and pointing towards the outside of the domain; therefore \(\tilde{I}_\mu \), which is defined as minus the gradient, will point inward. At the same time, \(D_\mu \) will also always point toward the interior of the domain because of the form of the constraint we have chosen in Eq. 17. Because by assumption \(\Vert D_\mu \Vert =1\), we have that
and
so that
Now, the signed distance of the boundary to the origin in \(\xi \)-space is
where the sign is taken such that l is negative when the origin is included in the integration domain. But noting that
we have
and therefore
Finally, by plugging Eq. 39 into Eq. 40 we obtain
where m and s are defined for convenience like so:
We note that m is a rescaled version of the margin defined by the constraint on the parameters (and therefore is never negative by assumption), and s is a rescaled version of the norm of the gradient of the log likelihood in the inverse observed Fisher metric (and therefore is nonnegative by construction).
1.5 A.5 Computing the Penalty
We can now perform a final change of variables in the integral in Eq. 34. We rotate our coordinates to align them to the boundary, so that
Note that we can always do this as our integrand is invariant under rotation. In this coordinate system, Eq. 34 factorizes:
where \(\mathrm {erfc}(\cdot )\) is the complementary error function ([1], [Section 7.1.2]).
Finally, plugging Eq. 44 into Eq. 22 and taking the log, we obtain the extended FIA:
where
can be interpreted as a penalty arising from the presence of the boundary in parameter space.
1.6 A.6 Interpreting the Penalty
We will now take a closer look at Eq. 46. To do this, one key observation we will use is that, by construction, at most one of m and s is ever nonzero. This is because in the interior of the manifold, \(m>0\) by definition, but \(s=0\) because the gradient of the likelihood is zero at \(\hat{\vartheta }\); and on the boundary, \(m=0\) by definition, and s can be either zero or positive.
Interior of the Manifold. When \(\hat{\vartheta }\) is in the interior of the parameter space \(\varTheta \), then \(\tilde{I}_\mu =0\Rightarrow s=0\) and Eq. 46 simplifies to
but since N is large we have \(m\gg 0\), \(\mathrm {erfc}(-m)\rightarrow 2\) and \(S\rightarrow 0\), so our result passes the first sanity check: we recover the expression in [3].
Boundary of the Manifold. When \(\hat{\vartheta }\) is on the boundary of \(\varTheta \), \(m=0\) and \(s\ge 0\). Equation 46 becomes
where w is the Feddeeva function ([1], [Section 7.1.3]):
This function is tabulated and can be computed efficiently. However, it is interesting to analyze its limiting behavior.
As a consistency check, when s is small we have at fixed N, to first order:
and \(S=\ln (2)\) when \(\tilde{I}_\mu =0\), as expected.
However, the real case of interest is the behavior of the penalty when N is assumed to be large, as this is consistent with the fact that we derived Eq. 44 as an asymptotic expansion of Eq. 23. In this case, using the asymptotic expansion for the Feddeeva function ([1], [Section 7.1.23]):
To leading order we obtain
which we can rewrite as
We can summarize the above by saying that a new penalty term of order \(\ln N\) arose due to the presence of the boundary. Interestingly, comparing Eq. 50 with Eq. 45 we see that the first term in Eq. 50 is analogous to counting an extra parameter dimension in the original Fisher Information Approximation.
Comparison of Fisher Information Approximation and full Bayes computation of the log posterior ratio (LPR) for the model pairs used in our psychophysics tasks (\(N=10\)). Each row corresponds to one task type (from top to bottom, “horizontal”, “point”, “rounded”, “vertical”). First column from the left: full Bayesian LPR, computed by numerical integration. Second column: LPR computed with the Fisher Information Approximation. Third column: difference between FIA and exact LPR. Fourth column: relative difference (difference divided by the absolute value of the FIA LPR).
1.7 A.7 Numerical Comparison of the Extended FIA vs Exact Bayes
Figure 5 shows that the FIA computed with the expressions given above provides a very good approximation to the exact Bayesian log posterior ratio (LPR) for the model pairs used in the psychophysics experiments, and for the chosen sample size (\(N=10\)). As highlighted in the panels in the rightmost column, the discrepancies between the exact and the approximated LPR are generally small in relative terms, and therefore are not very important for the purpose of model fitting and interpretation. Note that here, as well as for the results in the main text, the S term in the FIA is computed using Eq. 46 rather than Eq. 50 in order to avoid infinities (that for finite N can arise when the likelihood gradient is very small) and discontinuities (that for finite N can arise on the interior of the manifold, in proximity to the boundary, where the value of S goes from zero when \(\hat{\vartheta }\) is in the interior to \(\log (2)\) when \(\hat{\vartheta }\) is exactly on the boundary).
Even though overall the agreement between the approximation is good, it is interesting to look more closely at where it is the least so. The task type for which the discrepancies are the largest (both in absolute and relative terms) is the “rounded” type (third row in Fig. 5). This is because the FIA hypotheses are not fully satisfied everywhere for one of the models. More specifically, the models in that task variant are a circular arc (the bottom model in Fig. 5, third row) and a smaller circular arc, concentric with the first, with a straight segment attached to either side (the top model). The log-likelihood function for this second model is only smooth to first order, but its second derivative (and therefore its Fisher Information and its observed Fisher Information) are not continuous at the points where the circular arc is joined with the straight segments, locally breaking hypothesis number 3 in Subsect. A.1. Geometrically, this is analogous to saying that the curvature of the manifold changes abruptly at the joints. It is likely that the FIA for a model with a smoother transition between the circular arc and the straight arms would have been even closer to the exact value for all points on the 2D plane (the data space). More generally, this line of reasoning suggests that it would be interesting to investigate the features of a model that affect the quality of the Fisher Information Approximation.
B Supplementary Information on the Analysis of the Psychophysics Data
1.1 B.1 Technical Details of the Inference Procedure
Posterior sampling was performed with PyMC3 [26] version 3.9.3, using the NUTS Hamiltonian Monte Carlo algorithm [14], with target acceptance probability set to 0.9. The posterior distributions reported in the main text are built by sampling 8 independent Markov chains for 10000 draws each. No divergence occurred in any of the chains. Effective sample size and \(\hat{R}\) diagnostics for some of the key parameters are given in table Table 1 for a shorter run of the same procedure.
1.2 B.2 Posterior Predictive Checks
Simple posterior predictive check, looking at subject performance. A random sample of all subject-level parameters (\(\alpha _i\) and \(\beta _i\)) is taken at random from the MCMC chains used for model inference. Using those parameter values, a simulation of the experiment is run using the actual stimuli shown to the subjects, and the resulting performance of all 202 simulated subjects is recorded. This procedure is repeated 2000 times, yielding 2000 samples of the joint posterior-predictive distribution of task performance over all experimental subjects. To visualize this distribution, for each subject we plotted a cloud of 2000 dots where the y coordinate of each dot is the simulated performance of that subject in one of the simulations, and the x coordinate is the true performance of that subject in the experiment plus a small random jitter (for ease of visualization). The gray line is the identity, showing that our inference procedure captures well the behavioral patterns in the experimental data. (Color figure online)
We performed a simple posterior predictive check [18] to ensure that the Bayesian hierarchical model described in the main text captures the main pattern of behavior across our subjects. In Fig. 6, the behavioral performance of the subjects is compared with its posterior predictive distribution under the model. As can be seen from the figure, the performance of each subject is correctly captured by the model, across systematic differences between task types (with subjects performing better in the “vertical” task than the “rounded” task, for instance) as well as individual differences between subjects that performed the same task variant.
1.3 B.3 Formal Model Comparison
We compared the Bayesian hierarchical model described in the main text to a simpler model, where subjects were assumed to only be sensitive to likelihood differences, or in other words to choose \(\mathcal {M}_1\) over \(\mathcal {M}_2\) only based on which model was on average closer to the dot cloud constituting the stimulus on a given trial. Mathematically, this “likelihood only” model was equivalent to fixing all \(\beta \) parameters to zero except for \(\beta _L\) in the model described in the main text. All other details of the model were the same, and in particular the model still had a hierarchical structure with adaptive shrinkage (the subject-level parameters \(\alpha \) and \(\beta _L\) were modeled as coming from Student T distributions controlled by population-level parameters). We compared the full model and the likelihood-only using the Widely Applicable Information Criterion [9]. This comparison, shown in Table 2, reveals strong evidence in favor of the full model.
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Piasini, E., Balasubramanian, V., Gold, J.I. (2022). Effect of Geometric Complexity on Intuitive Model Selection. In: Nicosia, G., et al. Machine Learning, Optimization, and Data Science. LOD 2021. Lecture Notes in Computer Science(), vol 13163. Springer, Cham. https://doi.org/10.1007/978-3-030-95467-3_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-95467-3_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-95466-6
Online ISBN: 978-3-030-95467-3
eBook Packages: Computer ScienceComputer Science (R0)