An Incremental Fast Policy Search Using a Single Sample Path

Joseph, Ajin George; Bhatnagar, Shalabh

doi:10.1007/978-3-319-69900-4_1

Ajin George Joseph¹⁹ &
Shalabh Bhatnagar¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10597))

Included in the following conference series:

International Conference on Pattern Recognition and Machine Intelligence

2680 Accesses

Abstract

In this paper, we consider the control problem in a reinforcement learning setting with large state and action spaces. The control problem most commonly addressed in the contemporary literature is to find an optimal policy which optimizes the long run $\gamma $-discounted transition costs, where $\gamma \in [0,1)$. They also assume access to a generative model/simulator of the underlying MDP with the hidden premise that realization of the system dynamics of the MDP for arbitrary policies in the form of sample paths can be obtained with ease from the model. In this paper, we consider a cost function which is the expectation of a approximate value function w.r.t. the steady state distribution of the Markov chain induced by the policy, without having access to the generative model. We assume that a single sample path generated using a priori chosen behaviour policy is made available. In this information restricted setting, we solve the generalized control problem using the incremental cross entropy method. The proposed algorithm is shown to converge to the solution which is globally optimal relative to the behaviour policy.

You have full access to this open access chapter, Download conference paper PDF

An incremental off-policy search in a model-free Markov decision process using a single sample path

Article 13 February 2018

Generalized exploration in policy search

Article 13 July 2017

Good-for-MDPs Automata for Probabilistic Analysis and Reinforcement Learning

1 Introduction

In this paper, we consider a reinforcement learning setting with the underlying Markov decision process (MDP) defined by the 4-tuple $(\mathbb {S}, \mathbb {A}, \mathrm {R}, \mathrm {P})$, where the finite sets $\mathbb {S}$ and $\mathbb {A}$ are referred to as the state space and action space respectively. Also, $\mathrm {R}:\mathbb {S}\times \mathbb {A}\times \mathbb {S}\rightarrow \mathbb {R}$ is the reward function which defines the state transition costs and $\mathrm {P}: \mathbb {S}\times \mathbb {A}\times \mathbb {A}\rightarrow [0,1]$ is the transition probability function. A stationary randomized policy (SRP) $\pi $ is a probability mass function over the actions conditioned on the state space, i.e., for $s \in \mathbb {S}$, we have $\pi (\cdot \vert s) \in [0,1]^{\vert \mathbb {A}\vert }$ and $\sum _{a \in \mathbb {A}}\pi (a \vert s) = 1$. A policy determines the action to be taken at each discrete time step of an arbitrary realization of the MDP. In this paper, we employ a parametrized class of SRPs $\{\pi _{w} \vert w \in \mathbb {W} \subset \mathbb {R}^{k_2}\}$. We assume that $\mathbb {W}$ is compact.

By complying to a policy $\pi _w$, the behaviour of the MDP reduces to a Markov chain defined by the transition probabilities $\mathrm {P}_{w}(s, s^{\prime }) = \sum _{a \in \mathbb {A}}\pi _{w}(a \vert s)\mathrm {P}(s, a, s^{\prime })$. The performance of a policy is usually quantified by a value function which is defined as the long-run $\gamma $-discounted transition costs ($\gamma \in [0,1)$) incurred by the MDP while following the policy. The value function is formally defined as follows: $V^{w}(s) = \mathbb {E}_{w}\left[ \sum _{t \in \mathbb {N}}\gamma ^{t}\mathrm {R}(\mathbf {s}_t, \mathbf {a}_t, \mathbf {s}_{t+1}) \vert \mathbf {s}_0 = s\right] , s \in \mathbb {S}$, where $\mathbb {E}_{w}[\cdot ]$ is the expectation w.r.t. the probability distribution of the Markov chain induced by the policy $\pi _w$. And the primary goal in an RL setting is to find the optimal policy which solves $\mathop {\arg \mathrm{max}}\nolimits _{w \in \mathbb {W}}V^{w}$ without any knowledge of the model parameters $\mathrm {P}$ and $\mathrm {R}$. However, observations in the form of sample paths which are realizations of the MDP under any arbitrary policy are made available.

Classical approaches have complexities which scale polynomial in the cardinality of the state space and hence are intractable. This is commonly referred to as the curse of dimensionality. Hence, one has to resort to approximation techniques in order to achieve tractability. In this paper, for value function estimation, we consider the linear function approximation. Here, for a given policy $\pi _w$, we approximate its value function $V^{w}$ by projecting it on to the subspace $\{\varPhi x \vert x \in \mathbb {R}^{k_1}\}$, where $\varPhi = (\phi _1, \phi _2, \dots , \phi _{k_1})^{\top }$, $k_1 \ll \vert \mathbb {S}\vert $ and $\phi _i \in \mathbb {R}^{\vert \mathbb {S}\vert }$, $1 \le i \le k_1$ called the prediction features are chosen a priori. In order for the projection to be well-defined, we require the following assumption:

(A1): For each $w \in \mathbb {W}$ , the Markov chain induced by the policy $\pi _w$ is ergodic.

Under this assumption, one can define the weighted norm $\Vert \cdot \Vert _{\nu }$ as follows: For $V \in \mathbb {R}^{\vert \mathbb {S}\vert }$, $\Vert V \Vert _{\nu } = (\sum _{s \in \mathbb {S}}V^{2}(s)\nu (s))^{\frac{1}{2}}$, where $\nu $ is the limiting distribution (steady state distribution) of the given sample path. If the sample path is generated using a policy $\pi _{w_b}$, $w_b \in \mathbb {W}$ (referred to as the behaviour policy), then the limiting distribution is the stationary distribution $\nu _{w_b}$ of the Markov chain induced by the behaviour policy $\pi _{w_b}$. Therefore, the linear function approximation of the value function $V^{w}$ is defined as follows:

$$\begin{aligned} h_{w \vert w_b} \triangleq \mathop {\arg \mathrm{min}}\limits _{x \in \mathbb {R}^{k_1}}\Vert \varPhi x - V^{w}\Vert _{\nu _{w_b}} \end{aligned}$$

(1)

$$\begin{aligned} {} \textit{In this paper, we solve the following problem: } w^{*} = \mathop {\arg \mathrm{max}}\limits _{w \in \mathbb {W}}\mathbb {E}_{\nu _w}\left[ h_{w \vert w}\right] , \end{aligned}$$

(2)

where we assume that an infinitely long sample path $\{\mathbf {s}_0, \mathbf {a}_0, \mathbf {r}_0, \mathbf {s}_1$, $\mathbf {a}_1, \mathbf {r}_1, \mathbf {s}_2, \dots \}$ generated by the behaviour policy $\pi _{w_b}$ ($w_b \in \mathbb {W} \subseteq \mathbb {R}^{k_{2}}$ ) is available.

(A2): The behaviour policy $\pi _{w_b}$ , where $w_b \in \mathbb {W}$ , satisfies the following condition: $\pi _{w_b}(a \vert s) > 0$, $\forall s \in \mathbb {S}, \forall a \in \mathbb {A}$.

2 Proposed Algorithm

Our proposed approach has two components:

1.
A stochastic approximation (SA) version of the cross entropy (CE) method to solve the control problem (2). The SA version of the CE method is a zero-order, incremental, adaptive and stable global optimization method.
2.
A variation of the off-policy LSTD ($\lambda $) to compute the objective function values (i.e., $\mathbb {E}_{\nu _w}\left[ h_{w \vert w}\right] $).

We describe the above two components in detail here.

2.1 Stochastic Approximation Version of the Cross Entropy Method

Cross entropy method [4, 9] solves global optimization problems where the objective function does not possess good structural properties, i.e., those of the kind: Find $x^{*} = \, \mathop {\arg \mathrm{max}}\nolimits _{x \in \mathbb {X} \subset \mathbb {R}^{d}} J(x)$, where $J:\mathbb {X} \rightarrow \mathbb {R}$ is a bounded Borel measurable function ($J_l< J(x) < J_u$, $\forall x$). CE is a zero-order method which implies that the algorithm does not require the gradient or higher-order derivatives of the objective function in order to seek the optimal solution. CE method has found successful application in diverse domains which include continuous multi-extremal optimization [7], reinforcement learning [5, 6] and several NP-hard problems [7, 8].

CE method generates a sequence of model parameters $\{\theta _t\}_{t \in \mathbb {N}}$, $\theta _t \in \varTheta $ (assumed to be compact) and a sequence of thresholds $\{\gamma _t \in \mathbb {R}\}_{t \in \mathbb {N}}$, $J_l \le \gamma _t \le J_u$ and the algorithm attempts to direct the sequence $\{\theta _t\}$ towards the degenerate distribution concentrated at the global optimum $x^{*}$ and $\gamma _t$ towards $J(x^{*})$. The threshold $\gamma _{t+1}$ is usually taken as the $(1-\rho )$-quantile of J w.r.t the PDF $f_{\theta _t}$. (For $\theta \in \varTheta $, we denote by $\gamma _{\rho }(J, \theta )$ the $(1-\rho )$-quantile of J w.r.t the PDF $f_{\theta }$). Henceforth, $\gamma _{t+1} = \gamma _{\rho }(J, \theta _t)$. And the model parameter $\theta _{t+1}$ is generated by projecting (w.r.t. the Kullback-Leibler (KL) divergence) on to the family of distributions $\mathcal {F} \triangleq \{f_{\theta } \vert \theta \in \varTheta \}$, the zero-variance distribution concentrated in the region $\{J(x) \ge \gamma _{t+1}\}$ with respect to the PDF $f_{\theta _t}$. Thus,

$$\begin{aligned} \begin{aligned} \theta _{t+1} = \mathop {\arg \mathrm{min}}\limits _{\theta \in \varTheta } KL(f_{\theta }, g_t), \text { where } g_t(x) = \frac{S(J(x))f_{\theta _t}(x)\mathbb {I}_{\{J(x) \ge \gamma _{t+1}\}}}{\mathbb {E}_{\theta _t}\left[ S(J(\mathsf {X})) \mathbb {I}_{\{J(\mathsf {X}) \ge \gamma _{t+1}\}}\right] } \end{aligned} \end{aligned}$$

(3)

with $S:\mathbb {R}\rightarrow \mathbb {R}_{+}$ is a positive, monotonically increasing function.

In this paper, we consider the Gaussian distribution as the family of distributions $\mathcal {F}$. In this case, the PDF is being parametrized as $\theta =(\mu , \varSigma )^{\top }$, where $\mu $ and $\varSigma $ are the mean and the covariance of the Gaussian distribution respectively. Also, one can solve the optimization problem (3) analytically to obtain the following update rule:

$$\begin{aligned} \left. \begin{aligned}&\mu _{t+1} = \frac{\mathbb {E}_{\theta _t}\left[ \mathsf {g}_{1}(J(\mathsf {X}), \mathsf {X},\gamma _{t+1})\right] }{\mathbb {E}_{\theta _t}\left[ \mathsf {g}_{0}(J(\mathsf {X}), \gamma _{t+1})\right] } \triangleq \varUpsilon _1(\theta _t, \gamma _{t+1}),\\&\varSigma _{t+1} = \frac{\mathbb {E}_{\theta _t}\left[ \mathsf {g}_{2}(J(\mathsf {X}), \mathsf {X},\gamma _{t+1}, \mu _{t+1})\right] }{\mathbb {E}_{\theta _t}\left[ \mathsf {g}_{0}(J(\mathsf {X}), \gamma _{t+1})\right] } \triangleq \varUpsilon _2(\theta _t, \gamma _{t+1}). \end{aligned} \right\} \end{aligned}$$

(4)

$$\begin{aligned} \left. \begin{aligned} \text { where }&\mathsf {g}_{0}(J(x), \gamma ) \triangleq S(J(x))\mathbb {I}_{\{J(x) \ge \gamma \}},\\&\mathsf {g}_{1}(J(x), x,\gamma ) \triangleq S(J(x))\mathbb {I}_{\{J(x) \ge \gamma \}}x,\\&\mathsf {g}_{2}(J(x), x,\gamma , \mu ) \triangleq S(J(x))\mathbb {I}_{\{J(x) \ge \gamma \}}\left( x-\mu \right) \left( x-\mu \right) ^{\top }. \end{aligned} \right\} \end{aligned}$$

(5)

Thus the CE algorithm can be expressed as $\theta _{t+1} = (\varUpsilon _1(\theta _t, \gamma _{t+1}), \varUpsilon _2(\theta _t, \gamma _{t+1}))^{\top }$. However, this is the ideal scenario which is intractable due to the inability to compute the quantities $\mathbb {E}_{\theta _t}[\cdot ]$ and $\gamma _{\rho }(\cdot , \cdot )$. There are multiple ways one can track the ideal CE method. In this paper, we consider the efficient tracking of the ideal CE method using the stochastic approximation (SA) framework proposed in [1, 2]. The SA version of the CE method consists of three stochastic recursions which are defined as follows:

$$\begin{aligned} \begin{aligned} \gamma _{t+1} =&\gamma _{t} - \beta _{t} \varDelta \gamma _{t}(J(\mathsf {X}_{t+1})),\\&\text { where }\varDelta \gamma _{t}(y) \triangleq -(1-\rho )\mathbb {I}_{\{y \ge \gamma _t\}}+\rho \mathbb {I}_{\{y \le \gamma _t\}}. \end{aligned} \end{aligned}$$

(6)

$$\begin{aligned} \begin{aligned} \xi ^{(0)}_{t+1} =&\xi ^{(0)}_{t}+\beta _{t}\varDelta \xi ^{(0)}_{t}(\mathsf {X}_{t+1}, J(\mathsf {X}_{t+1})),\\&\text { where }\varDelta \xi ^{(0)}_{t}(x, y) \triangleq \mathsf {g_{1}}(y, x, \gamma _t) - \xi ^{(0)}_t \mathsf {g_{0}}(y, \gamma _t). \end{aligned} \end{aligned}$$

(7)

$$\begin{aligned} \begin{aligned} \xi ^{(1)}_{t+1} =&\xi ^{(1)}_{t} + \beta _{t} \varDelta \xi ^{(1)}_{t}(\mathsf {X}_{t+1}, J(\mathsf {X}_{t+1})),\\&\text { where }\varDelta \xi ^{(1)}_{j+1}(x, y) \triangleq \mathsf {g_{2}}(y, x, \gamma _t, \xi ^{(0)}_t) - \xi ^{(1)}_t \mathsf {g_{0}}(y, \gamma _t). \end{aligned} \end{aligned}$$

(8)

Here, $\beta _t > 0$ and $\mathsf {X}_{t+1} \sim \widehat{f}_{\theta _t}$, where the mixture PDF $\widehat{f}_{\theta _t}$ is defined as $\widehat{f}_{\theta _t} \triangleq (1-\zeta )f_{\theta _t} + \zeta f_{\theta _0}$, $\zeta \in (0,1)$, $f_{\theta _0}$ is the initial PDF. The mixture approach facilitates extensive exploration of the solution space and prevents the model iterates from getting stranded in suboptimal solutions.

2.2 Computing the Objective Function $\mathbb {E}_{\nu _w}\left[ h_{w \vert w}\right] $

In this paper, we employ the off-policy LSTD ($\lambda $) to approximate $h_{w \vert w}$ for a given policy parameter $w \in \mathbb {W}$. The procedure to estimate the objective function $\mathbb {E}_{\nu _w}\left[ h_{w \vert w}\right] $ is formally defined in Algorithm 1. The Predict procedure in Algorithm 1 is almost the same as the off-policy LSTD algorithm. The recursion (step 9) attempts to estimate the objective function $\mathbb {E}_{\nu _w}\left[ h_{w \vert w}\right] $ as follows:

$$\begin{aligned} \ell _{k+1}^{w} = \ell _{k}^{w} + \frac{1}{k}\Big (\mathbf {x}_{k}^{\top }\phi (\mathbf {s}_{k+1}) - \ell _{k}^{w}\Big ), \end{aligned}$$

(9)

For a given $w \in \mathbb {W}$, $\ell _{k}^{w}$ attempts to find an approximate value of the objective function J(w). The following lemma formally characterizes the limiting behaviour of the iterates $\ell _{k}^{w}$.

Lemma 1

For a given $w \in \mathbb {W}, \ell _{k}^{w} \rightarrow \ell ^{w}_{*} = \mathbb {E}_{\nu _{w_b}}\left[ x_{w \vert w_b}^{\top }\phi (\mathbf {s})\right] $ as $k \rightarrow \infty $ w.p. 1, where

$$\begin{aligned} \begin{aligned} x_{w | w_b} =&A^{-1}_{w \vert w_b} b_{w \vert w_b} \, with \, A_{w \vert w_b} = \varPhi ^{\top }D^{\nu _{w_b}}(\mathbb {I}-\gamma \lambda \mathrm {P}_{w})^{-1}(\mathbb {I}-\gamma \mathrm {P}_{w})\varPhi \\ \, and \,&b_{w \vert w_b} = \varPhi ^{\top }D^{\nu _{w_b}}(\mathbb {I}-\gamma \lambda \mathrm {P}_{w})^{-1}\mathrm {R}^{w}. \end{aligned} \end{aligned}$$

(10)

Here $D^{\nu _{w_b}}$ is the diagonal matrix with $D^{\nu _{w_b}}_{ii}=\nu _{w_b}(i)$, $1 \le i \le \vert \mathbb {S}\vert $, where $\nu _{w_b}$ is the stationary distribution of the Markov chain $\mathrm {P}_{w_b}$ induced by the behavior policy $\pi _{w_b}$ and $\mathrm {R}^{w} \in \mathbb {R}^{\vert \mathbb {S} \vert }$ with $\mathrm {R}^{w}(s) \triangleq \varSigma _{s^{\prime } \in \mathbb {S}, a \in \mathbb {A}}\pi _{w}(a \vert s)\mathrm {P}(s, a, s^{\prime })\mathrm {R}(s, a, s^{\prime })$.

Remark 1

By the above lemma, for a given $w \in \mathbb {W}$, the quantity $\ell _{k}^{w}$ tracks $J_b(w) \triangleq \mathbb {E}_{\nu _{w_b}}\left[ x^{\top }_{w \vert w_b}\phi (\mathbf {s})\right] $. This is however different from the true objective function value $J(w) = \mathbb {E}_{\nu _{w}}\left[ h_{w \vert w}\right] $, when $w \ne w_b$. This additional approximation error incurred is the extra cost one has to pay for the discrepancy in the policy which generated the sample path.

2.3 Proposed Algorithm

Our algorithm to solve the control problem (2) is illustrated in Algorithm 2. The following theorem shows that the model sequence $\{\theta _{t}\}$ and the averaged sequence $\{\overline{\theta }_{t}\}$ generated by Algorithm 2 converge to the degenerate distribution concentrated on the global maximum of the objective function $J_b$.

Theorem 1

Let $S(x) = exp(rx), r \in \mathbb {R}$. Let $\rho , \zeta \in (0,1)$ and $\theta _0 = (\mu _0, q\mathbb {I}_{k_2 \times k_2})^{\top }$, where $q \in \mathbb {R}_{+}$. Also, let $c_t \rightarrow 0$ as $t \rightarrow \infty $.. Let the learning rates $\overline{\beta }_{t}$ and $\beta _{t}$ satisfy $\sum _t \beta _t = \sum _t \overline{\beta }_t = \infty $, $\sum _t \beta ^{2}_t + \overline{\beta }^{2}_t < \infty $. Let $\{\theta _{t} = (\mu _{t}, \varSigma _{t})\}_{t \in \mathbb {N}}$ and $\{\overline{\theta }_{t} = (\overline{\mu }_{t}, \overline{\varSigma }_{t})\}_{t \in \mathbb {N}}$ be the sequences generated by Algorithm 2 and also assume $\theta _{t} \in \varTheta $, $\forall t \in \mathbb {N}$. Let $\overline{\beta }_{t} = o (\beta _{t})$. Let $w_b \in \mathbb {W}$ be the chosen behaviour policy vector. Also, let the assumptions (A1–A2) hold. Then, there exists $q^{*} \in \mathbb {R}_{+}$ and $r^{*} \in \mathbb {R}_{+}$ s.t. $\forall q > q^{*}$ and $\forall r > r^{*}$,

$$\begin{aligned} \theta _{t} \rightarrow (w^{b*}, 0_{k_2 \times k_2})^{\top } ,\overline{\theta }_{t} \rightarrow (w^{b*}, 0_{k_2 \times k_2})^{\top } \quad as \,\, t \rightarrow \infty , w.p.1, \end{aligned}$$

(13)

where $w^{b*} \in \mathop {\arg \mathrm{max}}\nolimits _{w \in \mathbb {W}} J_b(w)$ with $J_b(w) \triangleq \mathbb {E}_{\nu _{w_b}}\left[ x^{\top }_{w \vert w_b}\phi (\mathbf {s})\right] $.

3 Experimental Illustrations

The performance of our algorithm is evaluated on the chain walk MDP setting. This particular setting which is being proposed in [3] demonstrates the scenario where policy iteration is non-convergent when approximate value functions are employed instead of true ones. Here $\vert \mathbb {S} \vert = 300$, $\mathbb {A} = \{L, R\}$, $k_1=5$, $k_2=10$ and the discount factor $\gamma = 0.99$. The reward function $R(\cdot , \cdot , 100) = R(\cdot , \cdot , 200) = 1.0$ and zero for all other transitions. The transition probability kernel is given by $P(s, L, s+1) = 0.1$, $P(s, L, s-1) = 0.9$, $P(s, R, s+1) = 0.9$ and $P(s, R, s-1) = 0.1$. We choose the behaviour policy vector $w_{b} = (0,0,\dots ,0)^{\top }$. We employ the Gibbs “soft-max” class of policies: $\pi _{w}(a | s) = \frac{e^{(w^{\top }\psi (s, a)/\tau )}}{\sum _{b \in \mathbb {A}}e^{(w^{\top }\psi (s,b)/\tau )}}$, where $\{\psi (s, a) \in \mathbb {R}^{k_{2}} \vert s \in \mathbb {S}, a \in \mathbb {A}\}$ is a given policy feature set and $\tau \in \mathbb {R}_{+}$ is fixed a priori. We employ radial basis functions (RBF) as both policy and prediction features, i.e.,

where $m_i = 5+10(i-1), v_i = 5$, $1 \le i \le 5$ (Fig. 1).

The results are shown in Fig. 2.

4 Conclusion

We propose an adaptation of the cross entropy method to solve the control problem in reinforcement learning under an information restricted setting, where only a single sample path generated using a priori chosen behaviour policy is available. The proposed algorithm is shown to converge to the solution which is globally optimal relative to the behaviour policy.

References

Joseph, A.G., Bhatnagar, S.: A randomized algorithm for continuous optimization. In: Winter Simulation Conference, WSC 2016, Washington, DC, USA, 11–14 December 2016, pp. 907–918 (2016)
Google Scholar
Joseph, A.G., Bhatnagar, S.: Revisiting the cross entropy method with applications in stochastic global optimization and reinforcement learning. In: Frontiers in Artificial Intelligence and Applications, (ECAI 2016), vol. 285, pp. 1026–1034 (2016)
Google Scholar
Koller, D., Parr, R.: Policy iteration for factored MDPS. In: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pp. 326–334. Morgan Kaufmann Publishers Inc. (2000)
Google Scholar
Kroese, D.P., Porotsky, S., Rubinstein, R.Y.: The cross-entropy method for continuous multi-extremal optimization. Methodol. Comput. Appl. Probab. 8(3), 383–407 (2006)
Article MATH MathSciNet Google Scholar
Mannor, S., Rubinstein, R.Y., Gat, Y.: The cross entropy method for fast policy search. In: ICML, pp. 512–519 (2003)
Google Scholar
Menache, I., Mannor, S., Shimkin, N.: Basis function adaptation in temporal difference reinforcement learning. Ann. Oper. Res. 134(1), 215–238 (2005)
Article MATH MathSciNet Google Scholar
Rubinstein, R.: The cross-entropy method for combinatorial and continuous optimization. Methodol. Comput. Appl. Probab. 1(2), 127–190 (1999)
Article MATH MathSciNet Google Scholar
Rubinstein, R.Y.: Cross-entropy and rare events for maximal cut and partition problems. ACM Trans. Model. Comput. Simul. (TOMACS) 12(1), 27–53 (2002)
Article Google Scholar
Rubinstein, R.Y., Kroese, D.P.: The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning. Springer, New York (2013)
Book MATH Google Scholar

Download references

Acknowledgement

This work was supported in part by the Robert Bosch Centre for Cyber-Physical Systems, Indian Institute of Science, Bangalore.

Author information

Authors and Affiliations

Indian Institute of Science, Bangalore, India
Ajin George Joseph & Shalabh Bhatnagar

Authors

Ajin George Joseph
View author publications
You can also search for this author in PubMed Google Scholar
Shalabh Bhatnagar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shalabh Bhatnagar .

Editor information

Editors and Affiliations

Indian Statistical Institute, Kolkata, India
B. Uma Shankar
Indian Statistical Institute, Kolkata, India
Kuntal Ghosh
Indian Statistical Institute, Kolkata, India
Deba Prasad Mandal
Indian Statistical Institute, Kolkata, India
Shubhra Sankar Ray
The Hong Kong Polytechnic University, Hong Kong, China
David Zhang
Indian Statistical Institute, Kolkata, India
Sankar K. Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Joseph, A.G., Bhatnagar, S. (2017). An Incremental Fast Policy Search Using a Single Sample Path. In: Shankar, B., Ghosh, K., Mandal, D., Ray, S., Zhang, D., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2017. Lecture Notes in Computer Science(), vol 10597. Springer, Cham. https://doi.org/10.1007/978-3-319-69900-4_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-69900-4_1
Published: 01 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69899-1
Online ISBN: 978-3-319-69900-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

An Incremental Fast Policy Search Using a Single Sample Path

Abstract

Similar content being viewed by others