Search & Planning

Introduction to AI (I2AI)

Andy Weeger

Neu-Ulm University of Applied Sciences

March 12, 2025

Introduction

Planning agents

Rational agents select actions to maximize performance across time.

Planning involves mapping action sequences from current to goal states.

Solutions for search problems are efficient action sequences reaching goal states.

Optimization differs from planning—focuses on finding optimal states, not paths.

Examples

Terms

Agents that plan by considering a sequence of actions that form a path to a goal are called planning agents (Russel and Norvig 2022, 81).

  • The computational process it undertakes is search.
  • The representations the agents use are atomic representations1.
  • The search algrotihms can be uninformed and informed.

Here only simple environments are considered (episodic, single agent, fully observable, deterministic, static, discrete, and known).

We assume that information about the environment are given (e.g., a map)

Problem-solving process

In simple environments, agents can follow a four-phase-problem-solving process (Russel and Norvig 2022, 81–82):

  1. Goal formulation: Identifying the objectives to be achieved based on the current situation and agent’s purpose
  2. Problem formulation: Deciding what states and actions to consider to achieve the goal2
  3. Search: Systematically exploring possible sequences of actions to find a solution path from initial state to goal state
  4. Execution: Carrying out the solution actions in the environment

Search problem

Definition

A search problem can be defined formally as (Russel and Norvig 2022, 83):

  • State space: a set of possible states3 the environment can be in4
  • Initial state: the state that the agent starts
  • Goal state(s): a state that the agent is trying to reach5
  • Transition model: \(\textrm{RESULT}(s,a)\) — returns the state \(s'\) that results from performing action \(a\) in state \(s\)
  • Successor function: \(\textrm{ACTIONS}(s)\) — returns a set of (action, state) pairs for node \(s\), where the state is the state reachable by taking the action \(a\)
  • Action cost function: \(\textrm{ACTION-COST}(s,a,s')\) — gives the numeric cost of performing action \(a\) in state \(s\) to reach state \(s'\)
  • Path: a sequence of actions
  • Solution: a path from the initial state to the goal state

Problem formulation (modelling)

Figure 1: Example: A simplified map of Romania, with road distances in miles; based on Russel and Norvig (2022, 84)

Search trees

Definition

The formulation of a search problem, encompassing the definition of state space, initial state, goal states, transition model, and other pertinent elements, effectively constitutes the blueprint for a search tree:

  • The root node of the tree corresponds to your initial state
  • The branches represent actions defined by your transition model
  • The child nodes are states resulting from taking those actions
  • Leaf nodes are states with no successors (or states we choose not to expand)
  • Goal nodes are states that satisfy your goal condition

The search tree is the computational manifestation of your problem formulation.

Frontier

The frontier (sometimes called the “open list” or “fringe”) is the collection of nodes that have been generated but not yet expanded or evaluated.

The frontier serves as the boundary between explored and unexplored parts of the search space. It contains all the nodes that are immediate candidates for expansion.

You might also call the frontier the “to-do list” of nodes that the search algorithm maintains.

Example

Figure 2: Example: A partial search tree for finding a route from Arad to Bucharest; based on Russel and Norvig (2022)

Search algorithms

Definition

Search algorithms take a search problem as input and return a solution, or an indication of failure (Russel and Norvig 2022, 89).

  • They superimpose a search tree over the state-space graph,
  • form various paths from the initial state, and
  • try to find a path that reaches a goal state.

Depending on the problem formulation, they can implement different methods:

  • Uninformed search, which only have access to the problem definition but not clue about how close a state is to the goal(s).
  • Informed search, which have access to a heuristic function that gives domain-specific hints about how close a state is to the goal(s) (e.g., using straight-line distance in route-finding problems)

Performance metrics

Performance metrics guide the choice of algorithms based on problem requirements, help predict computational demands, and establish clear expectations about what an algorithm can and cannot do.

  • Completeness refers to an algorithm’s guarantee to find a solution.
  • Optimality refers to the algorithm’s capability to find the optimal solution.
  • Complexity determines time and memory/space requirements.
  • Global solutions (path) vs. local solutions (optimal or satisfactory state only).

2-Player games

Example: AlphaGo

MinMax value

Given a state-space search tree, each node’s MinMax value is calculated.

The MinMax value is the best achievable utility of being in a given sate (against a rational adversary).

  • MAX prefers to move to a state of maximum value
  • MIN prefers to move to a state of minimum value

Adversarial game tree

Figure 5: Example: A adversarial game tree

MinMax algorithm

The MinMax algorithm performs a complete depth-first exploration of the game tree (Russel and Norvig 2022, 196–96).

  • Assumes that the adversary plays optimal.
  • Returns action whose terminal state has the optimal MinMax value.
  • The tree is calculated down to a maximum depth \(L\) (i.e., the planning horizon).
  • If level \(L-1\) belongs to Min MinMax will choose in each state the action, which leads to a successor of minimum utility — since minimum utility for Max is maximum utility of Min. Therefore, to each node in level \(L-1\), the minimum utility of it’s successor nodes is assigned.
  • In the same way to each node in level \(L-2\), the maximum utility of it’s successor nodes in level \(L-1\) is assigned.
  • This process is repeated up to the root node. Max then knows his next action — the one, which yields to a successor with maximum utility-value.

Alpha-Beta Pruning

Alpha-Beta Pruning is an improvement of the MinMax algorithm in the sense that it reduces the number of nodes, that have to be evaluated and generated by applying a simple logic (Russel and Norvig 2022, 198–99). The process is as follows:

  • Create tree with limited depth search from left to right
  • As in MinMax-search evaluate nodes in the deepest level by applying a domain-specific evaluation function
  • For each Max-node store the maximum value of the underlying subtree traversed so far in the variable
  • For each Min-node store the minimum value of the underlying subtree traversed so far in the variable
  • If at a minimum node \(k\) the current value β is smaller than α the search under \(k\) can be terminated. Here α is the largest value of a maximum node in the path from the root to \(k\).
  • If at a maximum node k the current value α is larger than β, the search under \(k\) can be terminated. Here β is the smallest value of a minimum node in the path from the root to \(k\).

Introduction

The concept of a limited planning horizon, as applied in the MinMax-algorithm is one way to handle complex problems, which can not be planned to the goal-state in one step.

Another concept is to construct a tree until a final state in such a way that in each state only promising actions and corresponding successors are generated. This concept is applied by Monte Carlo tree search.

Functioning

Monte Carlo tree search (MCTS)9 does estimate the value of a state as the average utility10 over a number of simulations of complete games starting from the current state (Russel and Norvig 2022, 207–9).

Each iteration follows four steps:

  • Selection: starting from the root node, apply tree policy in order to select next action (e.g., based on average utility11).
  • Expansion: in a leaf-node select an arbitrary action and generate the corresponding new child-node.
  • Simulation: simulate the game, starting from the new child-node. At the end of each simulation the game is won or lost.
  • Backpropagation: the result of the simulation is used to update all the search tree nodes going up to the root.

If the computational budget is reached, MCTS returns with the best action a for the given root node.

Example: selection

Figure 6: Example: MCTS — selection

Figure 6 shows a tree with the root representing a state where P-A has won 37/100 playouts done

  • P-A has just moved to the root node
  • P-B selects a move to a node where it has won 60/79 playouts; this is the best win percentage among the available moves
  • P-A will select a move to a node where it has won 16/53 playouts (assuming it plays optimally)
  • P-B then continues on the leaf node marked 27/35
  • … until a terminal state is reached

Example: selection policy (UCT)

Upper confidence bounds applied to trees (UCT) is a very effective selection policy ranking possible moves based on an upper confidence bound formula (UCB1)

\[ \textrm{UCB1}(n) = \frac{\textrm{U}(n)}{\textrm{N}(n)} + C * \sqrt{\frac{\log{\textrm{N}(\textrm{PARENT}(n))}}{\textrm{N}(n)}} \]

  • \(\textrm{U}(n)\) is the total utility of all playouts that went through node \(n\)
  • \(\textrm{N}(n)\) is the number of playouts through node \(n\)
  • \(\textrm{PARENT}(n)\) is the parent node of \(n\) in the tree
  • \(\frac{\textrm{U}(n)}{\textrm{N}(n)}\) is the average utility of \(n\) (exploitation term, “how good are the stats?”)
  • \(\frac{\log{\textrm{N}(\textrm{PARENT}(n))}}{\textrm{N}(n)}\) is higher for \(n\) only explored a few times
    (exploration term, “how much has the child be ‘ignored’?”)
  • \(C\) is a constant that balance exploitation and exploration (theoretically \(\sqrt{2}\))

Example: expansion and simulation

Figure 7: Example: MCTS — expansion and simulation

Figure 7 shows a tree where a new child of the selected node is generated and marked with 0/0 (expansion).

A playout for the newly generated child node is performed (simulation).

  • Moves for both players according the playout policy12 are chosen
  • The moves are not recorded in the search tree
  • In Figure 7, the simulation results in a win for P-B

Example: back-propagation

Figure 8: Example: MCTS — selection

The result of the simulation is used to update all the search tree nodes going up to the root.

  • P-B's nodes are incremented in both the number of wins and the number of playouts
  • P-A's nodes are incremented in the number of playouts only

Wrap-up

Adversarial search

  • In two-player, discrete, deterministic, turn-taking zero-sum games with perfect information, the minimax algorithm can select optimal moves by a depth-first search in the game tree
  • Efficiency can be improved by using the alpha-beta search algorithm, which eliminates subtrees that are shown to be irrelevant.
  • Monte Carlo tree search evaluates states by playing out the game all the way to the end to see who won. This playout simulation is repeated multiple times. The evaluation is an average of the results.

xkcd

Figure 9: xkcd 1263: reassuring

Q&A

Exercises

Definition sequence

Problem formulation must follow goal formulation.

Discuss that statement. Is it true or false? Why?

Problem formulation

Give a complete problem formulation for each of the following problems. Choose a formulation that is precise enough to be implemented.

  • There are six glass boxes in a row, each with a lock. Each of the first five boxes holds a key unlocking the next box in line; the last box holds a banana. You have the key to the first box, and you want the banana.
  • There is an n x n grid of squares, each square initially being either unpainted floor or a bottomless pit. You start standing on an unpainted floor square, and can either paint the square under you or move into an adjacent unpainted floor square. You want the whole floor painted.

Maze

Your goal is to navigate a robot out of a maze. It starts in the center of the maze facing north. You can turn the robot to face north, east, south, or west; direct the robot to move forward a certain distance (it will stop before a wall).

  1. Formally define this problem as a search problem. How large is the state space?
  2. In navigating a maze, the only place we need to turn is at the intersection of two or more corridors. Reformulate this problem using this observation. How large is the state space now?
  3. From each point in the maze, we can move in any of the four directions until we reach a turning point, and this is the only action we need to do. Reformulate the problem using these actions. Do we need to keep track of the robot’s orientation now?
  4. In our initial description of the problem we already abstracted from the real world. Name three such simplifications we made.

Concepts

Explain in your own words the following terms:

  • Zero-sum
  • Terminal test
  • Minimax value
  • Selection policy
  • Playout policy
  • Monte Carlo tree
  • Back-propagation

MINIMAX

Explain if the MINIMAX algorithm is complete and optimal.

Can it be beaten by an opponent playing suboptimally? Why (not)?

Come up with a game tree in which MAX will beat a suboptimal MIN.

Pruning

Read the note about pruning (and consult Russel and Norvig (2022) if necessary).

Explain in your own words, under what conditions a subtree is skipped using Alpha-beta pruning.

Draw an example (game search tree, 3 levels depth).

Literature

Russel, Stuart, and Peter Norvig. 2022. Artificial Intelligence: A Modern Approach. Harlow: Pearson Education.

Footnotes

  1. An atomic representation is one in which each state is treated as a black box with not internal structure, meaning the state either does or does not match what you’re looking for.

  2. Problem formulation is specifically about defining the formal components of a search problem (state space, initial state, goal state, actions, etc.) — it’s about translating a real-world problem into the specific mathematical structure needed for search algorithms.

  3. A state is a situation that an agent can find itself in.

  4. Expressed by a graph whose nodes are the set of all states, and whose links are actions that transform one state into another

  5. Can be a single goal state, a small set of alternative goal states, or a property that applies to many states (e.g, no dirt in any location)

  6. First-in-first-out (FIFO) queue first pops the node that was added to the queue first

  7. Last-in-first-out queue (LIFO; also known as a stack) pops first the most recently added node

  8. First pops the node with the minimum costs according to \(f(n)\)

  9. Reading recommendation

  10. The average utility is guided by the selection policy.

  11. For games with binary outcomes, average utility equals win percentage

  12. Playout policies biases the moves toward good ones. For Go and other games, playout policies have been successfully learned from self-play by using neural networks. Sometimes also game-specific heuristics are used (e.g., take the corner square in Othello)