MDPs and Reinforcement Learning
CSCI 4511/6511
Announcements
- Homework Three:
- Turn in for 75% credit max through 25 Apr (hard deadline)
- Homework Four: Today
- Project Scope: 6 Apr
Markov Chains
Markov property:
P(Xt|Xt−1,Xt−2,...,X0)=P(Xt|Xt−1)
“The future only depends on the past through the present.”
- State Xt−1 captures “all” information about past
- No information in Xt−2 (or other past states) influences Xt
Markov Reward Process
- Reward function Rs=E[Rt+1|St=s]:
- Reward for being in state s
- Discount factor γ∈[0,1]
Ut=∑kγkRt+k+1
The Markov Decision Process
- Transition probabilities depend on actions
Markov Process:
st+1=stP
Markov Decision Process (MDP):
st+1=stPa
Rewards: Ra with discount factor γ
MDP - Policies
- Agent function
- Actions conditioned on states
π(s)=P[At=a|st=s]
- Can be stochastic
- Usually deterministic
- Usually stationary
MDP - Policies
State value function Uπ:1
Uπ(s)=Eπ[Ut|St=s]
State-action value function Qπ:2
Qπ(s,a)=Eπ[Ut|St=s,At=a]
Notation: Eπ indicates expected value under policy π
Bellman Expectation
Value function:
Uπ(s)=Eπ[Rt+1+γUπ(St+1)|St=s]
Action-value fuction:
Qπ(s,a)=Eπ[Rt+1+γQπ(St+1,At+1)|St=s,At=a]
Bellman Expectation
Value function:
Uπ(s)=Eπ[Rt+1+γUπ(St+1)|St=s]
Action-value fuction:
Qπ(s,a)=Eπ[Rt+1+γQπ(St+1,At+1)|St=s,At=a]
Understanding these equations lynchpins all of MDPs
Value Function
Uπ(s)=Eπ[Rt+1+γUπ(St+1)|St=s]
State-Action Value Function
Qπ(s,a)=Eπ[Rt+1+γQπ(St+1,At+1)|St=s,At=a]
Policy Evaluation
- How good is some policy π?
Uπ1(s)=R(s,π(s))
Uπk+1(s)=R(s,π(s))+γ∑s′T(s′|s,π(s))Uπk(s′)
Optimal Policies 😌
- There will always be an optimal policy
- Policy ordering:
- Optimal policy:
- π∗≥π,∀π
- Uπ∗(s)=U∗(s) and Qπ∗(s)=Q∗(s)
Optimal Policies
- Optimal policy π∗ maximizes expected utility from state s:
- State value function:
Optimal Policies
- State-action value function:
- Q(s, a) = R(s, a) + \gamma \sum \limits_{s'} T(s' | s, a) U(s')
- Greedy policy given some U(s):
- \pi(s) = \mathop{\operatorname{arg\,max}}\limits_a Q(s,a)
Partial Bellman Equation
Decision: U^*(s) = \max_a Q^*(s,a)
Partial Bellman Equation
Stochastic: Q^*(s, a) = R(s, a) + \gamma \sum \limits_{s'} T(s' | s, a) U^*(s')
Bellman Equation
U^*(s) = \max_a R(s, a) + \gamma \sum \limits_{s'} T(s' | s, a) U^*(s')
Bellman Equation
Q^*(s, a) = R(s, a) + \gamma \sum \limits_{s'} \max_a \left[T(s' | s, a) Q^*(s', a')\right]
How To Solve It
- No closed-form solution
- Optimal case differs from policy evaluation
Iterative Solutions:
- Value Iteration
- Policy Iteration
Reinforcement Learning
Dynamic Programming
- Assumes full knowledge of MDP
- Decompose problem into subproblems
- Bellman Equation: recursive decomposition
- Value function caches solutions
Iterative Policy Evaluation
Iteratively, for each algorithm step k:
U_{k+1}(s) = \sum \limits_{a \in A}\left(R(s, a) + \gamma \sum \limits_{s'\in S} T(s' | s, a) U_k(s') \right)
- Deterministic policy: only one action per state
Policy Iteration
Algorithm:
- Until convergence:
- Evaluate policy
- Select new policy according to greedy strategy
Greedy strategy:
\pi'(s) = \mathop{\operatorname{arg\,max}}\limits_a Q(s,a)
Unpacking the Notation
\pi(s) = \mathop{\operatorname{arg\,max}}\limits_a Q(s,a)
- \pi'(s)
- New policy \pi'
- New policy is a function of state: \pi'(s)
- Q(s,a)
- Value of state, action pair(s, a)
- Policy as function of state s
- Looks over all actions at each state
- Chooses action with highest value (argmax)
Policy Iteration
Previous step:
Current step:
- Q^\pi(s, \pi' (s)) \gets \max \limits_a Q^\pi(s,a) \geq Q^\pi(s, \pi(s))
Policy Iteration
Convergence:
- Q^\pi(s, \pi' (s)) = \max \limits_a Q^\pi(s,a) = Q^\pi(s, \pi(s))
Convergence
- Does our policy need to converge to U^\pi ?
- U^\pi represents value
- We care about policy1
Modified Policy Iteration:
- \epsilon-convergence
- k-iteration policy evaluation
- k = 1: Value Iteration
Value Iteration
Optimality:
- Given state s, states s' reachable
- Optimal policy \pi(s) achieves optimal value:
Assume:
- We have U^*(s')
- We want U^*(s)
Value Iteration
One-step lookahead:
U^*(s) \gets \max \limits_a \left( R(s,a) + \gamma \sum \limits_s' T(s'|s, a) U^*(s') \right)
- Apply updates iteratively
- Use current U(s') as “approximation” for U^*(s')
- That’s it. That’s the algorithm.
- Extract policy from values after completion.
Synchronous Value Iteration…
U_{k+1}(s) \gets \max \limits_a \left( R(s,a) + \gamma \sum \limits_s' T(s'|s, a) U_k(s') \right)
- U_k(s) held in memory until U_{k+1}(s) computed
- Effectively requires two copies of U
Asynchronous Value Iteration
- Updating U(s) for one state at a time:
U(s) \gets \max \limits_a \left( R(s,a) + \gamma \sum \limits_s' T(s'|s, a) U(s) \right)
- Ordering of states can vary
- Converges if all states are updated
- …and if algorithm runs infinitely
Multi-Armed Bandits
- Slot machine with more than one arm
- Each pull has a cost
- Each pull has a payout
- Probability of payouts unknown
- Goal: maximize reward
Solving Multi-Armed Bandits
😔
Confidence Bounds
- Expected value of reward per arm
- Confidence interval of reward per arm
- Select arm based on upper confidence bound
- How do we estimate rewards?
Bandit Strategies
- Upper Confidence Bound for arm M_i:
- UCB(M_i) = \mu_i + \frac{g(N)}{\sqrt{N_i}}
- g(N) is the “regret”
- Thompson Sampling
- Sample arm based on probability of being optimal
Tree Search
- Forget DFS, BFS, Dijkstra, A*
- State space too large
- Stochastic expansion
- Impossible to search entire tree
- Can simulate problem forward in time from starting state
Monte Carlo Tree Search
- Randomly simulate trajectories through tree
- Complete trajectory
- No heuristic needed1
- Need a model
- Better than exhaustive search?
Selection Policy
- Focus search on “important” parts of tree
- Similar to alpha-beta pruning
- Explore vs. exploit
- Simulation
- Not actually exploiting the problem
- Exploiting the search
Monte Carlo Tree Search
- Choose a node
- Explore/exploit
- Choose a successor
- Continue to leaf of search tree
- Expand leaf node
- Simulate result until completion
- Back-propagate results to tree
Monte Carlo Tree Search
Selection/Search
Monte Carlo Tree Search
Expansion
Monte Carlo Tree Search
Simulation/Rollout
Monte Carlo Tree Search
Back-Propagation
Upper Confidence Bounds for Trees (UCT)
- MDP: Maximize Q(s, a) + c\sqrt{\frac{\log{N(s)}}{N(s,a)}}
- Q for state s and action a
- POMDP: Maximize Q(h, a) + c\sqrt{\frac{\log{N(h)}}{N(h,a)}}
- Q for history h and action a
- History: action/observation sequence
- c is exploration bonus
UCT Search - Algorithm
![]()
Mykal Kochenderfer. Decision Making Under Uncertainty, MIT Press 2015
Monte Carlo Tree Search - Search
Monte Carlo Tree Search - Expansion
- State \notin T
- Initialize N(s,a) and Q(s,a)
- Add state to T
Monte Carlo Tree Search - Rollout
- Policy \pi_0 is “rollout” policy
- Usually stochastic
- States not tracked
Erstwhile
- States
- Actions
- Transition model between states, based on actions
- Known rewards
Model Uncertainty
- No model of transition dynamics
- No initial knowledge of rewards
😣
We can learn these things!
Model Uncertainty
Action-value function:
Q(s, a) = R(s, a) + \gamma \sum \limits_{s'} T(s' | s, a) U(s')
we don’t know T:
U^\pi(s) = E_\pi \left[ r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \gamma^3 r_{t+3} + ...|s \right]
Q(s, a) = E_\pi \left[ r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \gamma^3 r_{t+3} + ...|s,a \right]
Temporal Difference (TD) Learning
- Take action from state, observe new state, reward
U(s) \gets U(s) + \alpha \left[ r + \gamma U(s') - U(s)\right]
Update immediately given (s, a, r, s')
TD Error: \left[ r + \gamma U(s') - U(s)\right]
- Measurement: r + \gamma U(s')
- Old Estimate: U(s)
Q-Learning
- U^\pi gives us utility
- Solving for U^\pi allows us to pick a new policy
State-action value function: Q(s,a)
- \max_a Q(s,a) provides optimal policy
- Goal: Learn Q(s,a)
Q-Learning
Iteratively update Q:
Q(s,a) \gets Q(s,a) +\alpha \left[R + \gamma \max \limits_{a'} Q(s', a') - Q(s,a)\right]
- Current state s and action a
- Next state s', next action(s) a'
- Reward R
- Discount rate \gamma
- Learning rate \alpha
Q-Learning Algorithm
![]()
Mykal Kochenderfer. Decision Making Under Uncertainty, MIT Press 2015
Sarsa
Q-Learning: Q(s,a) \gets Q(s,a) +\alpha \left[R + \gamma \max \limits_{a'} Q(s', a') - Q(s,a)\right]
Sarsa:
Q(s,a) \gets Q(s,a) +\alpha \left[R + \gamma Q(s', a') - Q(s,a)\right]
Differences?
Q-Learning vs. Sarsa
- Sarsa is “on-policy”
- Evaluates state-action pairs taken
- Updates policy every step
- Q-learning is “off-policy”
- Evaluates “optimal” actions for future states
- Updates policy every step
Exploration vs. Exploitation
- Consider only the goal of learning the optimal policy
- Always picking “optimal” policy does not search
- Picking randomly does not check “best” actions
- \epsilon-greedy:
- With probability \epsilon, choose random action
- With probability 1-\epsilon, choose ‘best’ action
- \epsilon need not be fixed
Eligibility Traces
- Q-learning and Sarsa both propagate Q-values slowly
- Only updates individual state
- Recall MCTS:
- (Also recall that MCTS needs a generative model)
Eligibility Traces
- Keep track of what state-action pairs agent has seen
- Include future rewards in past Q-values
- Very useful for sparse rewards
- Can be more efficient for non-sparse rewards
Eligibility Traces
- Keep N(s,a): “number of times visited”
- Take action a_t from state s_t:
- N(s_t,a_t) \gets N(s_t,a_t) + 1
- Every time step:1
- \delta = R + \gamma Q(s',a') - Q(s,a)
- Q(s,a) \gets \alpha \delta N(s,a)
- N(s,a) \gets \gamma \lambda N(s,a)
- Discount factor \gamma
- Time decay \lambda
Sarsa-\lambda
Sarsa:
- Q(s,a) \gets Q(s,a) +\alpha \left[R + \gamma Q(s', a') - Q(s,a)\right]
Sarsa-\lambda:
- \delta = R + \gamma Q(s',a') - Q(s,a)
- Q(s,a) \gets \alpha \delta N(s,a)
Sarsa-\lambda
Q-\lambda ?
Q-Learning:
\quad \quad Q(s,a) \gets Q(s,a) +\alpha \left[R + \gamma \max \limits_{a'} Q(s', a') - Q(s,a)\right]
Sarsa:
\quad \quad Q(s,a) \gets Q(s,a) +\alpha \left[R + \gamma Q(s', a') - Q(s,a)\right]
Sarsa-\lambda:
\quad \quad \delta = R + \gamma Q(s',a') - Q(s,a) \quad \quad Q(s,a) \gets \alpha \delta N(s,a)
Watkins Q-\lambda
Idea: only keep states in N(s,a) that policy would have visited
Some actions are greedy: \max \limits_a' Q(s, a')
Some are random
On random action, reset N(s,a)
Why the difference from Sarsa?
Approximation Methods
- Large problems:
- Continuous state spaces
- Very large discrete state spaces
- Learning algorithms can’t visit all states
- Assumption: “close” states \rightarrow similar state-action values
Local Approximation
- Store Q(s,a) for a limited number of states: \theta(s,a)
- Weighting function \beta
- Maps true states to states in \theta
Q(s,a) = \theta^T\beta(s,a)
Update step:
\theta \gets \theta + \alpha \left[R + \gamma \theta^T \beta(s', a') - \theta^T\beta(s, a)\right] \beta(s, a)
Linear Approximation Q-Learning
Mykal Kochenderfer. Decision Making Under Uncertainty, MIT Press 2015
Example: Grid Interpolations
References
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. 2nd Edition, 2018.
Mykal Kochenderfer, Tim Wheeler, and Kyle Wray. Algorithms for Decision Making. 1st Edition, 2022.
John G. Kemeny and J. Laurie Snell, Finite Markov Chains. 1st Edition, 1960.
Stanford CS234 (Emma Brunskill)
UCL Reinforcement Learning (David Silver)
Stanford CS228 (Mykal Kochenderfer)