Abstract
We use a model for discrete stochastic search in which one or more objects (“targets”) are to be found by a search over n locations (“boxes”), where n is infinitely large. Each box has some probability that it contains a target, resulting in a distribution H over boxes. We model the search for the targets as a stochastic procedure that draws boxes using some distribution S. We derive first a general expression on the expected number of misses \(\text {E}[Z]\) made by the search procedure in terms of H and S. We then obtain an expression for an optimal distribution \(S^{*}\) to minimise \(\text {E}[Z]\). This results in a relation between: the entropy of H and the KL-divergence between H and \(S^{*}\). This result induces a 2-partitions over the boxes consisting of those boxes with H probability greater than \(\frac{1}{n}\) and the rest. We use this result to devise a stochastic search procedure for the practical situation when H is unknown. We present results from simulations that agree with theoretical predictions; and demonstrate that the expected misses by the optimal seeker decreases as the entropy of H decreases, with a maximum obtained for uniform H. Finally, we demonstrate applications of this stochastic search procedure with a coarse assumption about H. The theoretical results and the procedure are applicable to stochastic search over any aspect of machine learning that involves a discrete search-space: for example, choice over features, structures or discretized parameter-selection. In this work, the procedure is used to select features for Deep Relational Machines (DRMs) which are Deep Neural Networks (DNNs) defined in terms of domain-specific knowledge and built with features selected from large, potentially infinite-attribute space. Empirical results obtained across over 70 real-world datasets show that using the stochastic search procedure results in significantly better performances than the state-of-the-art.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
References
Abadi, M., Agarwal, A., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/
Ando, H.Y., Dehaspe, L., Luyten, W., Van Craenenbroeck, E., Vandecasteele, H., Van Meervelt, L.: Discovering H-bonding rules in crystals with inductive logic programming. Mol. Pharm. 3(6), 665–674 (2006). https://doi.org/10.1021/mp060034z
Blum, A.: Learning boolean functions in an infinite attribute space. Mach. Learn. 9(4), 373–386 (1992). https://doi.org/10.1007/BF00994112
Chollet, F., et al.: Keras (2015). https://keras.io
Dash, T., Srinivasan, A., Vig, L., Orhobor, O.I., King, R.D.: Large-scale assessment of deep relational machines. In: Riguzzi, F., Bellodi, E., Zese, R. (eds.) ILP 2018. LNCS (LNAI), vol. 11105, pp. 22–37. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99960-9_2
Fog, A.: Sampling methods for wallenius’ and fisher’s noncentral hypergeometric distributions. Commun. Stat. Simul. Comput.® 37(2), 241–257 (2008). https://doi.org/10.1080/03610910701790236
Ho, Y.C., Zhao, Q.C., Jia, Q.S.: Ordinal Optimization: Soft Optimization for Hard Problems. Springer, Boston (2007). https://doi.org/10.1007/978-0-387-68692-9
Kelly, F.: On optimal search with unknown detection probabilities. J. Math. Anal. Appl. 88(2), 422–432 (1982)
King, R.D., Muggleton, S.H., Srinivasan, A., Sternberg, M.J.: Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proc. Nat. Acad. Sci. U.S.A. 93(1), 438–42 (1996). https://doi.org/10.1073/pnas.93.1.438
King, R.D., Muggleton, S.H., Srinivasan, A., Sternberg, M.: Structure-activity relationships derived by machine learning: the use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. Proc. Nat. Acad. Sci. 93(1), 438–442 (1996). https://doi.org/10.1073/pnas.93.1.438
Kinga, D., Adam, J.B.: A method for stochastic optimization. In: International Conference on Learning Representations (ICLR), vol. 5 (2015)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lidbetter, T., Lin, K.: Searching for multiple objects in multiple locations. arXiv preprint arXiv:1710.05332 (2017)
Lodhi, H.: Deep relational machines. In: Lee, M., Hirose, A., Hou, Z.-G., Kil, R.M. (eds.) ICONIP 2013. LNCS, vol. 8227, pp. 212–219. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-42042-9_27
Muggleton, S., De Raedt, L.: Inductive logic programming: theory and methods. J. Logic Program. 19, 629–679 (1994). https://doi.org/10.1016/0743-1066(94)90035-3
Ruckle, W.H.: A discrete search game. In: Raghavan, T.E.S., Ferguson, T.S., Parthasarathy, T., Vrieze, O.J. (eds.) Theory and Decision Library, pp. 29–43. Springer, Netherlands (1991). https://doi.org/10.1007/978-94-011-3760-7_4
Srinivasan, A.: A study of two probabilistic methods for searching large spaces with ILP. Technical report PRG-TR-16-00, Oxford University Computing Laboratory, Oxford (2000)
Stone, L.D.: Theory of Optimal Search, vol. 118. Elsevier, Amsterdam (1976)
Subelman, E.J.: A hide-search game. J. Appl. Probab. 18(3), 628–640 (1981). https://doi.org/10.2307/3213317
Van Craenenbroeck, E., Vandecasteele, H., Dehaspe, L.: Dmax’s functional group and ring library. https://dtai.cs.kuleuven.be/software/dmax/ (2002)
Vig, L., Srinivasan, A., Bain, M., Verma, A.: An investigation into the role of domain-knowledge on the use of embeddings. In: Lachiche, N., Vrain, C. (eds.) ILP 2017. LNCS (LNAI), vol. 10759, pp. 169–183. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-78090-0_12
Šourek, G., Aschenbrenner, V., Železny, F., Kuželka, O.: Lifted relational neural networks. In: Proceedings of the 2015th International Conference on Cognitive Computation: Integrating Neural and Symbolic Approaches, vol. 1583, pp. 52–60. COCO 2015. CEUR-WS.org, Aachen, Germany, Germany (2015). http://dl.acm.org/citation.cfm?id=2996831.2996838
Acknowledgments
The second author (A.S.) is a Visiting Professorial Fellow, School of CSE, UNSW Sydney. This work is partially supported by DST-SERB grant EMR/2016/002766, Government of India.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Appendix: Proofs
Appendix: Proofs
Proof of Lemma 1
Proof
The ideal case is \(\mathrm {E}[Z]=0\). That is, on average, the search opens the correct box k on its first attempt. Now \(\mathrm {P}(Z=0| \text{ the } \text{ ball } \text{ is } \text{ in } \text{ box } k) = h_k s_k = h_k (1-s_k)^0 s_k\). Since the ball can be in any of the n boxes, \(\mathrm {P}(Z=0) = \sum _{k=1}^{n} h_k (1-s_k)^0 s_k\). More generally, for \(Z=j\), the search opens wrong boxes j times, and \(\mathrm {P}(Z=j) = \sum _{k=1}^{n} h_k (1-s_k)^j s_k\). The expected number of misses can now be computed:
Swapping the summations over j and k, we get
This simplifies to:
Proof of Lemma 2
Proof
This extends the Lemma 1 (Expected Cost of Misses by the Seeker) to a general case of multiple (K) stationary hiders. The number of ways the K hiders can choose to hide in n boxes is \(^nP_K\) and let \(\mathsf {P}(n,K)\) denote a set of all such permutations. For example, \(\mathsf {P}(3,2) = \{(1,2), (1,3), (2,1), (2,3), (3,1), (3,2)\}\).
All the K hiders can hide in any one of these choices with probability \(\left( h_{\sigma (i)}^{(1)} h_{\sigma (i)}^{(2)} \dots h_{\sigma (i)}^{(K)}\right) \), where \(h_{\sigma (i)}^{(k)}\) denotes the probability of the hider in kth place in the selected choice of \(\sigma (i)\). Analogously, the seeker can find any one of these hiders with probability \(\left( s_{\sigma (i)}^{(1)} + s_{\sigma (i)}^{(2)} + \dots + s_{\sigma (i)}^{(K)}\right) \), and would not find the hider once is \(1 - \left( s_{\sigma (i)}^{(1)} + s_{\sigma (i)}^{(2)} + \dots + s_{\sigma (i)}^{(K)}\right) \). If the hider makes j such misses, then it is \(\left\{ 1-\left( s_{\sigma (i)}^{(1)} + s_{\sigma (i)}^{(2)} + \dots s_{\sigma (i)}^{(K)}\right) \right\} ^j\). Now, the expected misses for this multiple hider formulation is given as
This further simplifies to
Proof of Theorem 1
Proof
The problem can be posed as a constrained optimisation problem in which the objective function that is to be minimized is
Our objective is to minimize the function f given any hider distribution H. Let us represent \(\mathbf {\nabla } f = \left( \dfrac{\partial f}{\partial s_1}, \dfrac{\partial f}{\partial s_2}, \ldots , \dfrac{\partial f}{\partial s_n}\right) \). In this problem, \( \mathbf {\nabla } f = \left( -\frac{h_1}{s_1^2}, -\frac{h_2}{s_2^2}, \ldots , -\frac{h_n}{s_n^2} \right) \). Now, computing the double derivative \(\mathbf {\nabla }^2 f\), we get
Since, \(\forall i, h_i \ge 0, s_i \ge 0\), we can claim that \(\mathbf {\nabla }^2 f\) has all non-negative second derivative components. And, therefore f is convex.
Proof of Theorem 2
Proof
We will write \(\text {E}[Z]|_{H,S}\) as a function of S i.e. f(S). Our objective is to minimise \(f(S) = \sum _{i=1}^{n}{\frac{h_i}{s_i}}\) subject to constraint \(\sum _{i=1}^{n}s_i = 1\). The corresponding dual form (unconstrained) of this minimisation problem can be written as
To obtain the optimal values of S and \(\lambda \), we set \(\frac{\partial g}{\partial s_i} = 0\) for \(i = 1,\ldots ,n\), and \(\frac{\partial g}{\partial \lambda } = 0\). This gives: \(-\frac{h_i}{s_i^2} - \lambda = 0~\text {and}~ \sum _{i=1}^{n}s_i = 1\). From this: \(s_i = -\frac{\sqrt{h_i}}{\sqrt{\lambda }},~\forall i\). Applying this quantity for \(s_i\) in \(\sum _{i=1}^{n}s_i = 1\) and the value of the parameter \(\lambda = -\frac{h_i}{s_i^2}\), we get: \(-\frac{\sum _{i=1}^{n}\sqrt{h_i}}{-\frac{\sqrt{h_i}}{s_i}} = 1\). Simplifying the above, we obtain the desired optimal seeker distribution \(S^*\): \(s^*_i = \frac{\sqrt{h_i}}{\sum _{j=1}^{n}{\sqrt{h_j}}},~\forall i \in \{1,\ldots ,n\}\).
Proof of Corollary 1
Proof
If S is non-uniform with \(\forall s_i>0\), we have \(\text {E}[Z]|{H,S} = \frac{1}{n}\sum _{i=1}^n{\frac{1}{s_i}} - 1 \ge \frac{n}{\sum _{i=1}^n{s_i}} - 1\) and the denominator is 1 because S is a distribution. So, \(S^*\) must be a uniform distribution and in this case, the quantity \(\text {E}[Z]|_{H,S} = \sum _{i=1}^{n}{\frac{1/n}{1/n}} - 1 = \sum _{i=1}^{n}1 - 1 = n - 1.\)
Proof of Corollary 2
Proof
The proof is as follows:
Hence, the result follows.
Proof of Theorem 3
Proof
The KL-divergence between the two distribution H and \(S^*\) is defined as:
Simplifying, we get:
Proof of Lemma 3
Proof
The probability that a randomly drawn box is not in the U partition is \((1-p)\). The probability that in a sample of s boxes, none are from the U partition is \((1-p)^s\), and therefore the probability that there is at least 1 box amongst the s from the U partition is \(1 - (1-p)^s\). We want this probability to be at least \(\alpha \). That is:
With some simple arithmetic, it follows that
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Dash, T., Srinivasan, A., Joshi, R.S., Baskar, A. (2019). Discrete Stochastic Search and Its Application to Feature-Selection for Deep Relational Machines. In: Tetko, I., Kůrková, V., Karpov, P., Theis, F. (eds) Artificial Neural Networks and Machine Learning – ICANN 2019: Deep Learning. ICANN 2019. Lecture Notes in Computer Science(), vol 11728. Springer, Cham. https://doi.org/10.1007/978-3-030-30484-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-30484-3_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30483-6
Online ISBN: 978-3-030-30484-3
eBook Packages: Computer ScienceComputer Science (R0)