Abstract
If A and B are sets such that \(A \subset B\), generalisation may be understood as the inference from A of a hypothesis sufficient to construct B. One might infer any number of hypotheses from A, yet only some of those may generalise to B. How can one know which are likely to generalise? One strategy is to choose the shortest, equating the ability to compress information with the ability to generalise (a “proxy for intelligence”). We examine this in the context of a mathematical formalism of enactive cognition. We show that compression is neither necessary nor sufficient to maximise performance (measured in terms of the probability of a hypothesis generalising). We formulate a proxy unrelated to length or simplicity, called weakness. We show that if tasks are uniformly distributed, then there is no choice of proxy that performs at least as well as weakness maximisation in all tasks while performing strictly better in at least one. In experiments comparing maximum weakness and minimum description length in the context of binary arithmetic, the former generalised at between 1.1 and 5 times the rate of the latter. We argue this demonstrates that weakness is a far better proxy, and explains why Deepmind’s Apperception Engine is able to generalise effectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
This proof is conditional upon certain assumptions regarding the nature of cognition as enactive, and a formalism thereof.
- 2.
Assuming tasks are uniformly distributed, and weakness is well defined.
- 3.
An example of how one might translate propositional logic into this representation is given at the end of this paper. It is worth noting that this representation of logical formulae addresses the symbol grounding problem [12], and was specifically constructed to address subjective performance claims in the context of AIXI [13].
- 4.
Each state is just reality from the perspective of a point along one or more dimensions. States of reality must be separated by something, or there would be only one state of reality. For example two different states of reality may be reality from the perspective of two different points in time, or in space and so on.
- 5.
Statements are the logical formulae about which we will reason.
- 6.
e.g. \(Z_s\) is the extension of s.
- 7.
For example, we might represent chess as a supervised learning problem where \(s \in S_\alpha \) is the state of a chessboard, \(z \in Z_s\) is a sequence of moves by two players that begins in s, and \(d \in D_\alpha \cap Z_s\) is such a sequence of moves that terminates in victory for one player in particular (the one undertaking the task).
- 8.
For example we might use weakness multiplied by a constant to the same effect.
- 9.
\(\frac{2^{|Z_\textbf{h} |}}{2^{|L_\mathfrak {v} |}}\) is maximised when \(\textbf{h} = \emptyset \), because the optimal hypothesis given no information is to assume nothing (you’ve no sequence to predict, so why make assertions that might contradict the environment?).
- 10.
Two statements a and b are mutually exclusive if \(a \not \in Z_b\) and \(b \not \in Z_a\), which we’ll write as \(\mu (a,b)\). Given \(x \in L_\mathfrak {v}\), the set of all mutually exclusive statements is a set \(K_x \subset L_\mathfrak {v}\) such that \(x \in K_x\) and \(\forall a, b \in K_x : \mu (a,b)\). It follows that \(\forall x \in L_\mathfrak {v}, \underset{b \in K_x}{\sum }\ p(b) = 1\).
- 11.
We acknowledge that some may object to the term universal, because \(\mathfrak {v}\) is finite.
- 12.
We do not know which possibilities will eventuate. A less specific statement contradicts fewer possibilities. Of all hypotheses sufficient to explain what we perceive, the least specific is most likely.
References
Bennett, M.T.: Technical Appendices. Version 1.2.1 (2023). https://doi.org/10.5281/zenodo.7641742. https://github.com/ViscousLemming/Technical-Appendices
Sober, E.: Ockham’s Razors: A User’s Manual. Cambridge University Press (2015)
Rissanen, J.: Modeling by shortest data description*. Automatica 14, 465–471 (1978)
Chollet, F.: On the Measure of Intelligence (2019)
Chaitin, G.: The limits of reason. Sci. Am. 294(3), 74–81 (2006)
Solomonoff, R.: A formal theory of inductive inference. Part I. Inf. Control 7(1), 1–22 (1964)
Solomonoff, R.: A formal theory of inductive inference. Part II. Inf. Control 7(2), 224–254 (1964)
Kolmogorov, A.: On tables of random numbers. Sankhya: Indian J. Stati. A 369–376 (1963)
Hutter, M.: Universal Artificial Intelligence: Sequential Decisions Based on Algorithmic Probability. Springer, Heidelberg (2010)
Bennett, M.T.: Symbol emergence and the solutions to any task. In: Goertzel, B., Iklé, M., Potapov, A. (eds.) AGI 2021. LNCS (LNAI), vol. 13154, pp. 30–40. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-93758-4_4
Ward, D., Silverman, D., Villalobos, M.: Introduction: the varieties of enactivism. Topoi 36(3), 365–375 (2017). https://doi.org/10.1007/s11245-017-9484-6
Harnad, S.: The symbol grounding problem. Physica D: Nonlinear Phenomena 42(1), 335–346 (1990)
Leike, J., Hutter, M.: Bad universal priors and notions of optimality. In: Proceedings of the 28th COLT, PMLR, pp. 1244–1259 (2015)
Gupta, A.: Definitions. In: Zalta, E.N. (ed.) The Stanford Encyclopedia of Philosophy. Winter 2021. Stanford University (2021)
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. In: NeurIPS. Curran Association Inc., USA (2019)
Kirk, D.: NVIDIA Cuda Software and GPU parallel computing architecture. In: ISMM 2007, Canada, pp. 103–104. ACM (2007)
Meurer, A., et al.: SymPy: symbolic computing in Python. PeerJ Comput. Sci. 3, e103 (2017). https://doi.org/10.7717/peerj-cs.103
Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 4(2), 100–107 (1968)
Hernández-Orallo, J., Dowe, D.L.: Measuring universal intelligence: towards an anytime intelligence test. Artif. Intell. 174(18), 1508–1539 (2010)
Legg, S., Veness, J.: An approximation of the universal intelligence measure. In: Dowe, D.L. (ed.) Algorithmic Probability and Friends. Bayesian Prediction and Artificial Intelligence. LNCS, vol. 7070, pp. 236–249. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-44958-1_18
Evans, R.: Kant’s cognitive architecture. Ph.D. thesis. Imperial (2020)
Evans, R., Sergot, M., Stephenson, A.: Formalizing Kant’s rules. J. Philos. Logic 49, 613–680 (2020)
Evans, R., et al.: Making sense of raw input. Artif. Intell. 299 (2021)
Bennett, M.T.: Compression, the fermi paradox and artificial super-intelligence. In: Goertzel, B., Iklé, M., Potapov, A. (eds.) AGI 2021. LNCS (LNAI), vol. 13154, pp. 41–44. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-93758-4_5
Delétang, G., et al.: Neural Networks and the Chomsky Hierarchy (2022)
Power, A., et al.: Grokking: generalization beyond overfitting on small algorithmic datasets. In: ICLR (2022)
Acknowledgement
Appendices available on GitHub [1], supported by JST (JPMJMS2033).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bennett, M.T. (2023). The Optimal Choice of Hypothesis Is the Weakest, Not the Shortest. In: Hammer, P., Alirezaie, M., Strannegård, C. (eds) Artificial General Intelligence. AGI 2023. Lecture Notes in Computer Science(), vol 13921. Springer, Cham. https://doi.org/10.1007/978-3-031-33469-6_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-33469-6_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-33468-9
Online ISBN: 978-3-031-33469-6
eBook Packages: Computer ScienceComputer Science (R0)