Announcements
- Final Exam: 25 Apr
- Project Milestone 1: 20 Apr
- Project Milestone 2: 27 Apr
- Homework Three:
- Turn in for 75% credit max through 25 Apr (hard deadline)
Final Exam
- In class 25 Apr (next week!)
- 120 minutes
- 14 sides of notes (8.5”x11” or A4)
- Final exam grade can replace midterm grade if higher
- Midterm exam grade will not replace final exam grade
- Calculators permitted
Reflex Agent
- Very basic form of agent function
- Percept \(\rightarrow\) Action lookup table
- Good for simple games
- Needs entire state space in table
Partially-Observable State
- Most real-world problems
- Reflex agents fail1
- Agent needs a belief state
State
What is the state space?
Search: Why?
- Fully-observed problem
- Deterministic actions and state
- Well defined start and goal
Other Applications
- Route planning
- Protein design
- Robotic navigation
- Scheduling
Not Included
- Uncertainty
- Adversary
- Cooperation
- Continuous state
Search Problem
Search problem includes:
- Start State
- State Space
- State Transitions
- Goal Test
State Space:

Actions & Successor States:

State Space Graph
Search Trees
Graph:

Tree:

Let’s Talk About Trees
- For any non-trivial problem, they’re big
- (Effective) branching factor
- Depth
- Graph and tree both too large for memory
- Successor function (graph)
- Expansion function (tree)
How To Solve It
Given:
- Starting node
- Goal test
- Expansion
Do:
- Expand nodes from start
- Test each new node for goal
- Expand new nodes
- If nothing left to expand, failure
Tree Search Algorithms
- BFS
- DFS
- UCS/Dijkstra
- A*
- Greedy searches
A* Search
- Include path-cost \(g(n)\)
- Complete (always)
- Optimal (sometimes)
- Painful \(O(b^m)\) time and space complexity
Choosing Heuristics
- Recall: \(h(n)\) estimates cost from \(n\) to goal
- Admissibility
- Consistency
Choosing Heuristics
- Admissibility
- Never overestimates cost from \(n\) to goal
- Cost-optimal!
- Consistency
- \(h(n) \leq c(n, a, n') + h(n')\)
- \(n'\) successors of \(n\)
- \(c(n, a, n')\) cost from \(n\) to \(n'\) given action \(a\)
Consistency
- Consistent heuristics are admissible
- Inverse not necessarily true
- Always reach each state on optimal path
Weighted A* Search
- Greedy: \(f(n) = h(n)\)
- A*: \(f(n) = h(n) + g(n)\)
- Uniform-Cost Search: \(f(n) = g(n)\)
…
- Weighted A* Search: \(f(n) = W\cdot h(n) + g(n)\)
Iterative-Deepening A* Search
“IDA*” Search
- Similar to Iterative Deepening with Depth-First Search
- DFS uses depth cutoff
- IDA* uses \(h(n) + g(n)\) cutoff with DFS
- Once cutoff breached, new cutoff:
- Typically next-largest \(h(n) + g(n)\)
- \(O(b^m)\) time complexity 😔
- \(O(d)\) space complexity1 😌
Where Do Heuristics Come From?
- Intuition
- Relaxation
- The problem is constrained
- Remove the constraint
- Pre-computation
- Learning
Local Search
Uninformed/Informed Search:
- Known start, known goal
- Search for optimal path
Local Search:
- “Start” is irrelevant
- Goal is not known
- But we know it when we see it
- Search for goal
“Real-World” Examples
- Scheduling
- Layout optimization
- Portfolio management
- Others?
Hill-Climbing
- Objective function
- State space mapping
Variations
- Sideways moves
- Stochastic moves
- Random restarts
- If at first you don’t succeed,
you fail try again!
- Complete 😌
The Trouble with Local Maxima
- We don’t know that they’re local maxima
- Hill climbing is efficient
- Exhaustive search is complete
- But it’s exhaustive!
- Stochastic methods are ‘exhaustive’
Simulated Annealing
- Doesn’t actually have anything to do with metallurgy
- Search begins with high “temperature”
- Temperature decreases during search
- Next state selected randomly
- Improvements always accepted
- Non-improvements rejected stochastically
- Higher temperature, less rejection
- “Worse” result, more rejection
Local Beam Search
Recall:
- Beam search keeps track of \(k\) “best” branches
Local Beam Search:
- Hill climbing search, keeping track of \(k\) successors
Simple Games
- Two-player
- Turn-taking
- Discrete-state
- Fully-observable
- Zero-sum
- This does some work for us!
Minimax
- Initial state \(s_0\)
- Actions(\(s\)) and To-move(\(s\))
- Result(\(s, a\))
- Is-Terminal(\(s\))
- Utility(\(s, p\))
More Than Two Players
- Two players, two values: \(v_A, v_B\)
- Zero-sum: \(v_A = -v_B\)
- Only one value needs to be explicitly represented
- \(> 2\) players:
- \(v_A, v_B, v_C ...\)
- Value scalar becomes \(\vec{v}\)
Minimax Efficiency
Pruning removes the need to explore the full tree.
- Max and Min nodes alternate
- Once one value has been found, we can eliminate parts of search
- Lower values, for Max
- Higher values, for Min
- Remember highest value (\(\alpha\)) for Max
- Remember lowest value (\(\beta\)) for Min
Solving Non-Deterministic Games
Previously: Max and Min alternate turns
Now:
😣
Constraint Satisfaction
- Express problem in terms of state variables
- Constrain state variables
- Begin with all variables unassigned
- Progressively assign values to variables
- Assignment of values to state variables that “works:” solution
Constraint Types
- Unary: restrict single variable
- Can be rolled into domain
- Why even have them?
- Binary: restricts two variables
- Global: restrict “all” variables
Constraint Examples
- \(X_1\) and \(X_2\) both have real domains, i.e. \(X_1, X_2 \in \mathbb{R}\)
- A constraint could be \(X_1 < X_2\)
- \(X_1\) could have domain \(\{\text{red}, \text{green}, \text{blue}\}\) and \(X_2\) could have domain \(\{\text{green}, \text{blue}, \text{orange}\}\)
- A constraint could be \(X_1 \neq X_2\)
- \(X_1, X_2, ..., X_100 \in \mathbb{R}\)
- Constraint: exactly four of \(X_i\) equal 12
- Rewrite as binary constraint?
Assignments
- Assignments must be to values in each variable’s domain
- Assignment violates constraints?
- All variables assigned?
Graph Representation I
Constraint graph: edges are constraints
Graph Representation II
Constraint hypergraph: constraints are nodes
Solving CSPs
- We can search!
- …the space of consistent assignments
- Complexity \(O(d^n)\)
- Domain size \(d\), number of nodes \(n\)
- Tree search for node assignment
- Inference to reduce domain size
- Recursive search
Inference
- Arc consistency
- Reduce domains for pairs of variables
- Path consistency
- Assignment to two variables
- Reduce domain of third variable
Ordering
- Select-Unassgined-Variable(\(CSP, assignment\))
- Choose most-constrained variable1
- Order-Domain-Variables(\(CSP, var, assignment\))
- Tree-structure: Linear time solution
Logic
- \(\neg\)
- “Not” operator, same as CS (
!
, not
, etc.)
- \(\land\)
- “And” operator, same as CS (
&&
, and
, etc.)
- This is sometimes called a conjunction.
- \(\lor\)
- “Inclusive Or” operator, same as CS.
- This is sometimes called a disjunction.
Unfamiliar Logical Operators
- \(\Rightarrow\)
- Logical implication.
- If \(X_0 \Rightarrow X_1\), \(X_1\) is always True when \(X_0\) is True.
- If \(X_0\) is False, the value of \(X_1\) is not constrained.
- \(\iff\)
- “If and only If.”
- If \(X_0 \iff X_1\), \(X_0\) and \(X_1\) are either both True or both False.
- Also called a biconditional.
Knowledge Base & Queries
- We encode everything that we ‘know’
- We query the knowledge base
- Statement that we’d like to know about
- Logic:
- Is statement consistent with KB?
Entailment
- \(KB \models A\)
- “Knowledge Base entails A”
- For every model in which \(KB\) is True, \(A\) is also True
- One-way relationship: \(A\) can be True for models where \(KB\) is not True.
- Vocabulary: \(A\) is the query
Knowing Things
Falsehood:
- \(KB \models \neg A\)
- No model exists where \(KB\) is True and \(A\) is True
It is possible to not know things:1
- \(KB \nvdash A\)
- \(KB \nvdash \neg A\)
Converting to CNF
- \(X_0 \iff X_1\)
- \((X_0 \Rightarrow X_1) \land (X_1 \Rightarrow X_0)\)
- \(X_0 \Rightarrow X_1\)
- \(\neg (X_0 \land X_1)\)
- \(\neg X_0 \lor \neg X_1\)
- \(\neg (X_0 \lor X_1)\)
- \(\neg X_0 \land \neg X_1\)
Joint Distributions
- Distribution over multiple variables
- \(P(x, y)\) represents \(P\{X=x, Y=y\}\)
- Marginal distribution:
Independence
Conditional probability:
\[P(x | y) = \frac{P(x, y)}{P(y)}\]
Bayes’ rule:
\[P(x | y) = \frac{P(y | x)P(x)}{P(y)} \]
Conditional Independence
\[P(x | y) = P(x) \rightarrow P(x,y) = P(x) P(y)\]
- Two variables can be conditionally independent…
- … when conditioned on a third variable
Markov Chains
Markov property:
\(P(X_{t} | X_{t-1},X_{t-2},...,X_{0}) = P(X_{t} | X_{t-1})\)
“The future only depends on the past through the present.”
- State \(X_{t-1}\) captures “all” information about past
- No information in \(X_{t-2}\) (or other past states) influences \(X_{t}\)
State Transitions
Stochastic matrix \(P\)
\[
P = \begin{bmatrix}
P_{1,1} & \dots & P_{1,n}\\
\vdots & \ddots & \\
P_{n, 1} & & P_{n,n}
\end{bmatrix}
\]
- All rows sum to 1
- Discrete state spaces implied
Stationary Behavior
- “Long run” behavior of Markov chain
\(x_0 P^k\) for large \(k\)
- “Stationary state” \(\pi\) such that:
\(\pi = \pi P\)
- Row eigenvector for \(P\) for eigenvalue 1
- 😌
Absorbing States
- State that cannot be “escaped” from
- Example: gambling \(\rightarrow\) running out of money
\(P = \begin{bmatrix} 0.5 & 0.3 & 0.1 & 0.1 \\ 0.3 & 0.4 & 0.3 & 0 \\ 0.1 & 0.6 & 0.2 & 0.1 \\ 0 & 0 & 0 & 1 \end{bmatrix}\)
- Non-absorbing states: “transient” states
Markov Reward Process
- Reward function \(R_s = E[R_{t+1} | S_t = s]\):
- Reward for being in state \(s\)
- Discount factor \(\gamma \in [0, 1]\)
\(U_t = \sum_k \gamma^k R_{t+k+1}\)
The Markov Decision Process
- Transition probabilities depend on actions
Markov Process:
\(s_{t+1} = s_t P\)
Markov Decision Process (MDP):
\(s_{t+1} = s_t P^a\)
Rewards: \(R^a\) with discount factor \(\gamma\)
MDP - Policies
- Agent function
- Actions conditioned on states
\(\pi(s) = P[A_t = a | s_t = s]\)
- Can be stochastic
- Usually deterministic
- Usually stationary
MDP - Policies
State value function \(U^\pi\):1
\(U^\pi(s) = E_\pi[U_t | S_t = s]\)
State-action value function \(Q^\pi\):2
\(Q^\pi(s,a) = E_\pi[U_t | S_t = s, A_t = a]\)
Notation: \(E_\pi\) indicates expected value under policy \(\pi\)
Bellman Expectation
Value function:
\(U^\pi(s) = E_\pi[R_{t+1} + \gamma U^\pi (S_{t+1}) | S_t = s]\)
Action-value fuction:
\(Q^\pi(s, a) = E_\pi[R_{t+1} + \gamma Q^\pi(S_{t+1}, A_{t+1}) | S_t = s, A_t =a]\)
Bellman Equation
\[U^*(s) = \max_a R(s, a) + \gamma \sum \limits_{s'} T(s' | s, a) U^*(s')\]
Bellman Equation
\[Q^*(s, a) = R(s, a) + \gamma \sum \limits_{s'} T(s' | s, a) \max_a Q^*(s', a')\]
How To Solve It
- No closed-form solution
- Optimal case differs from policy evaluation
Iterative Solutions:
- Value Iteration
- Policy Iteration
Reinforcement Learning:
Model Uncertainty
Action-value function:
\(Q(s, a) = R(s, a) + \gamma \sum \limits_{s'} T(s' | s, a) U(s')\)
we don’t know \(T\):
\(U^\pi(s) = E_\pi \left[ r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \gamma^3 r_{t+3} + ...|s \right]\)
\(Q(s, a) = E_\pi \left[ r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \gamma^3 r_{t+3} + ...|s,a \right]\)
Temporal Difference (TD) Learning
- Take action from state, observe new state, reward
\(U(s) \gets U(s) + \alpha \left[ r + \gamma U(s') - U(s)\right]\)
- Update immediately given \((s, a, r, s')\)
- TD Error: \(\left[ r + \gamma U(s') - U(s)\right]\)
- Measurement: \(r + \gamma U(s')\)
- Old Estimate: \(U(s)\)
RL Methods
- Q-Learning
- Sarsa
- Eligibility traces
RL Methods
Q-Learning:
\(\quad \quad Q(s,a) \gets Q(s,a) +\alpha \left[R + \gamma \max \limits_{a'} Q(s', a') - Q(s,a)\right]\)
Sarsa:
\(\quad \quad Q(s,a) \gets Q(s,a) +\alpha \left[R + \gamma Q(s', a') - Q(s,a)\right]\)
Sarsa-\(\lambda\):
\(\quad \quad \delta = R + \gamma Q(s',a') - Q(s,a)\) \(\quad \quad Q(s,a) \gets \alpha \delta N(s,a)\)
Monte Carlo Tree Search - Search
Monte Carlo Tree Search - Expansion
- State \(\notin T\)
- Initialize \(N(s,a)\) and \(Q(s,a)\)
- Add state to \(T\)
Monte Carlo Tree Search - Rollout
- Policy \(\pi_0\) is “rollout” policy
- Usually stochastic
- States not tracked
References
Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. 4th Edition, 2020.
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. 2nd Edition, 2018.
Mykal Kochenderfer, Tim Wheeler, and Kyle Wray. Algorithms for Decision Making. 1st Edition, 2022.
UC Berkeley CS188
Stanford CS234 (Emma Brunskill)
Stanford CS228 (Mykal Kochenderfer)