Tech in T: depth + breadth‎ > ‎Math‎ > ‎Optimization‎ > ‎

Markov Decision Process

modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker.  At each time step, the process is in some state s, and the decision maker may choose any action a that is available in state s. The process responds at the next time step by randomly moving into a new state s', and giving the decision maker a corresponding reward R_a(s,s').

The probability that the process moves into its new state
s' is influenced by the chosen action. given by the state transition function P_a(s,s'). Thus, the next state s' depends on the current state s and the decision maker's action a. But given s and a, it is conditionally independent of all previous states and actions; in other words, the state transitions of an MDP possess the Markov property.

 if only one action exists for each state and all rewards are zero, a Markov decision process reduces to a Markov chain.

Ronald A. Howard's book, Dynamic Programming and Markov Processes, in 1960

A simple MDP with three states and two actions/
    From each state you go to an action and from that action you go to the next state with some probability. How to chose among actions to get maximum reward?

Find the policy for the decision maker (which action to take?):

 if only one action exists for each state and all rewards are zero, a Markov decision process reduces to a Markov chain.

Discounted Sum:

R_a_t is the reward of action a at time t when we are in state s_t and proceed to state s_(t+1) due to the action.
Where γ is the discount factor ( 0 ≤ γ < 1 ): when t → ∞, γ^t → 0
As time goes on less discount to stabilize the cumulative reward system. the first actions have the most rewards.
maximize the cumulative function of the random reward,

by linear programming mathematically or by dynamic programming as in algorithms.
by backtracking on dynamic programming

The standard family of algorithms to calculate this optimal policy requires storage for two arrays indexed by state: value V, which contains real values, and policy π which contains actions. At the end of the algorithm, π will contain the solution and V(s) will contain the discounted sum of the rewards to be earned (on average) by following that solution from state s.

The algorithm has the following two kinds of steps, which are repeated in some order for all the states until no further changes take place. They are

which means choose the action a that yileds best reward for us in terms of policy π, and calculate the corresponding value as follows in a backtracking fashion. from s' to s

Their order depends on the variant of the algorithm; one can also do them for all states at once or state by state, and more often to some states than others. As long as no state is permanently excluded from either of the steps, the algorithm will eventually arrive at the correct solution.