Partially observed Markov decision processes with binomial observations

https://doi.org/10.1016/j.orl.2013.01.005Get rights and content

Abstract

We consider partially observed Markov decision processes with control limits. We analytically show how the finite-horizon control limits are non-monotonic in (a) the time remaining and (b) the probability of obtaining a conforming unit. We also prove that the infinite-horizon control limit can be calculated by solving a finite set of linear equations.

Introduction

A production process can be in either a “Good” or a “Bad” state. The states form a Markov chain: if the process is in the good state, while producing one unit (during one period), there is a constant probability that it will deteriorate to be in the bad state while producing the next unit (during the next period). Once the process enters the bad state it remains there. Units produced in either state may end up conforming or defective. The probability of obtaining a conforming unit in the good state is larger than that of obtaining a conforming unit in the bad state.

A controller observes the process periodically over time. The true state is unobservable and can only be inferred from the quality of the output. At the beginning of each period, the controller must select one of two actions: CONTINUE (CON: do nothing) or REPLACE (REP: renew the system for a fixed cost). The objective is to maximize the expected present value of total future profits.

The problem above represents a discrete-time partially observed Markov decision process (POMDP). The fundamental idea is to base actions upon the probability that the system is in the good state. This “good state probability”, also referred to as the “information state”, is updated periodically, using Bayes’ formula.

POMDPs provide a powerful probabilistic tool for decision making. Structural results have been studied for over thirty years; see [1], [5], [14], [17], [22]. For several computational procedures and algorithms, see [5], [13], [15], [18], [21], [24], [25], [27] and the references therein. The extension of the model to more than two states and more than two actions was discussed in [14], [16], [17], [19].

Several authors have proposed applications of POMDPs to machine maintenance, machine replacement, and quality control; among them are [2], who considered a lot-sizing problem with inspection and non-rigid demand, and [9], who considered a lot-sizing problem with inspection and rigid demand. Applications in other domains may be found in [17], [26], [23]. Givon and Grosfeld-Nir [6] provided an application for optimal control of TV shows. Lane [12] analyzed a POMDP application for fishermen. Aviv and Pazgal [3] studied a pricing problem faced by sellers of fashion-like goods. Hu et al. [10] explored control policies in medical drug therapy. Kaelbling et al. [11] considered navigation scenarios. Smallwood and Sondik [20] looked at object detection scenarios. Ben-Zvi and Nickerson [4] investigated intruder detection strategies.

This paper shows non-monotonic properties of POMDPs. (1) We analytically show how the finite-horizon control limits, as a function of the time remaining, are not necessarily monotonic; (2) we analytically show how the control limits, as a function of probability of obtaining a conforming unit, are not necessarily monotonic. In addition, we prove that the infinite-horizon control limit can be calculated by solving a finite set of linear equations.

Although the first two properties were suggested in previous studies (see, for example, [2], [7], [9]), only numerical examples were given, without providing any intuition to why that happens. The lack of analytical insight into these peculiar properties, and the complexity of the numerical calculations mentioned above, motivated our study. We accentuate that researchers obtaining such numerical results might be tempted not to trust their calculations, and thus analytical analysis is warranted.

Furthermore, the analytical formulas we provide can help test and compare existing POMDP algorithms. These formulas also make it easy to test the sensitivity of the problem parameters. In addition, we provide several numerical results. These results are innovative in that calculations do not involve complex programs and are easy to replicate.

Section snippets

Preliminaries

A production process can be in either a “Good” or a “Bad” state. The true state is unobservable and can only inferred from the quality of the output. Products are classified as either conforming or defective. The probability of obtaining a conforming unit in the Good (Bad) state is θ0 (θ1). That is, let Y=0 (Y=1) denote that a unit produced in state Z is conforming (defective). Then, P(Y=0|Z=Good)=θ0 and P(Y=0|Z=Bad)=θ1. Naturally, we assume that θ0>θ1.

The states are probabilistically

The model

We assume that the revenue during each period strictly increases in the quality of the output. We denote m0 and m1 the (per period) expected revenue in the good and the bad states, respectively; thus, m0>m1. The objective is to maximize the expected present value of the total future profits.

Let n be the number of remaining periods. We denote UnCON(x) the expected present value of the total future profits if the current action is CONTINUE, the information state is x, and all future actions are

“Bad in Bad” (BinB)

We refer to the problem with θ1=0 as “Bad-in-Bad” (BinB): all units produced in the bad state end up defective (bad). Hence, after detecting a conforming unit, it is certain that the state was good. Using θ1=0, we have p(x,0)=xθ0, p(x,1)=1xθ0, h(x,0)=r, and h(x,1)=rx(1θ0)/(1xθ0).

Thus, (7) becomes VnCON(x)=x+α[xθ0Vn1(r)+(1xθ0)Vn1(h(x,1))], n1.

Note that Vn1(r)=k+Vn1REP, and, for x=0+, Vn1(h(x,1))=Vn1REP, n2. That is, for small values of x, VnCON(x)=αVn1REP+(1+αkθ0)x, n2. This

Infinite horizon

Taking n in (7), (8), we have VCON(x)=x+α[p(x,0)V(h(x,0))+p(x,1)V(h(x,1))].VREP=k+VCON(r).

We refer to (9), (10) as the infinite-horizon optimality equations and to VCON(x) as the infinite-horizon value function. It follows from Theorem 1 and from Grosfeld-Nir [8] that VCON(x) is convex and strictly increasing; and that there is a unique control limit, C, so that CON is optimal if and only if xC. The control limit, C, is the root of VCON(x)=VREP.

Next, we show several cases where C can be

Conclusions

The POMDP model uses recursive equations, making analytical insight hard to obtain. We provide analytical analysis demonstrating some of the peculiar properties of the model. Our analytical formulas are particularly useful in testing and comparing existing POMDP algorithms. These formulas also make it easy to test the sensitivity of the problem parameters.

The binomial observations model we analyzed is practical and valuable in its own right. Future research could use this model for application.

References (27)

  • A. Grosfeld-Nir et al.

    Production with rigid demand and costly inspection

    Nav. Res. Logistics.

    (2007)
  • C. Hu et al.

    Comparison of some suboptimal control policies in medical drug therapy

    Oper. Res.

    (1996)
  • D.E. Lane

    A partially observable model of decision making by Fishermen

    Oper. Res.

    (1989)
  • Cited by (9)

    View all citing articles on Scopus
    View full text