🧠 Introduction to AI
Neu-Ulm University of Applied Sciences
February 12, 2024
Agents that plan ahead by considering a sequence of actions that form a path to a goal state are called problem-solving agents (Russel and Norvig 2022, 81)
Here only simple environments are considered (episodic, single agent, fully observable, deterministic, static, discrete, and known).
We assume that information about the environment are given (e.g., a map)
In simple environments, agents can follow a four-phase-problem-solving process (Russel and Norvig 2022, 81–82):
A search problem can be defined formally as (Russel and Norvig 2022, 83):
Search algorithms take a search problem as input and return a solution, or an indication of failure (Russel and Norvig 2022, 89).
They can implement
The Greedy best-first search uses the evlation function \(f(n) = h(n)\)
It expands the first node with the lowest \(h(n)\) value in the queue.
In route finding problems, the straight-line distance heuristic is used as \(h(n)\).
Greedy best-first graph search is complete in finite state spaces, but not in infinite ones.
The A* search (pronounced “A-star search”) uses the evaluation function \(f(n) = g(n) + h(n)\)
A* search is complete, if it is cost-optimal depends on the heuristic
The agents considered so far use offline search algorithm. They compute a complete solution before taking their first action. They are not helpful in unknown environments.
Online search agents interleaves computation and action:
These agents can discover successor only for a state that is occupied or that is learned (i.e., contained in a map created online)
Online search is a good idea in dynamic or semi-dynamic environments.
AlphaGo is a computer program that plays the board game Go, which is considered much more difficult for computers to win than other games such as chess because its strategic and aesthetic nature makes it difficult to construct a direct scoring function. AlphaGo was developed by the London-based DeepMind Technologies, an acquired subsidiary of Alphabet Inc.
AlphaGo use a Monte Carlo tree search algorithm to find their moves based on knowledge previously acquired through machine learning, specifically an artificial neural network (a deep learning method) through extensive training, both from human and computer games. A neural network is trained to recognize the best moves and the winning rates of those moves. This neural network improves the strength of the tree search, leading to stronger move selection in the next iteration.
In competitive environments, two or more agents have conflicting goals.
This gives rise to adversarial search problems.
The AI community is particularly interested in games of a simplified nature (e.g., chess, go, and poker), as in such games
The games most commonly studied within AI are deterministic (two-player, turn-taking, fully observable, zero-sum8) (Russel and Norvig 2022, 192).
Possible formalization
The initial state, \(\textrm{ACTIONS}(s)\), and \(\textrm{RESULT}(s,a)\) define the state space graph, where the vertices are states, the edges are moves and the state might be reached by multiple paths.
Players (here MIN
and MAX
, alternate turns) need to have a conditional plan—a contingent strategy specifying a response to each move of the opponent.
Given a state-space search tree, each node’s minimax value is calculated.
The minimax value is the best achievable utility of being in a given sate (against a rational adversary).
MAX
prefers to move to a state of maximum valueMIN
prefers to move to a state of minimum valueThe △ nodes are “MAX nodes”, in which it is MAX’s turn to move; the ▽ nodes are “MIN nodes”. MAX’s best move at the root is α₁ (the highest minimax value), MIN’s best move is β₁ (the lowest minimax value).
The minimax algorithm performs a complete depth-first exploration of the game tree (Russel and Norvig 2022, 196–96).
The exponential complexity makes the miminmax algorithm impractical for complex games (even with alpha-beta pruning applied; chess game tree size > atoms in the universe).
Pruning to reduce computing complexity (Russel and Norvig 2022, 198–99).
Pruning stops the search at the moment when it is determined that the value of a subtree is worse than the best solution already identified.
The general principle is as follows: consider a node n somewhere in the tree, such that Player has a choice of moving to n. If Player has a better choice either at the same level or at any point higher up in the tree, then Player will never move to n. So enough about n is found out (by examining some of its descendants) to reach this conclusion, it can be pruned (Russel and Norvig 2022, 198).
Alpha-beta pruning gets its name from the two extra parameters in MAX-VALUE(state,α,β) that describe the bounds on the backed-up values that appear anywhere along the path:
MAX
found so far (“at least”)MIN
found so far (“at most”)Alpha-beta search updates the values of α and β as it goes along and prunes the remaining branches at a node as soon as the value of the current node is known to be worse than the current α and β for MAX
or MIN
, respectively.
What is pruning?
Read the note about pruning (and consult Russel and Norvig (2022) or the internet if necessary).
Explain in your own words, under what conditions a subtree is skipped using Alpha-beta pruning.
Draw an example (game search tree, 3 levels depth).
Monte Carlo tree search (MCTS)11 does estimate the value of a state as the average utility12 over a number of simulations of complete games starting from the current state (Russel and Norvig 2022, 207–9).
Each iteration follows four steps:
Repeats for a set number of iterations or until a given time is over. At the end the successor with the hightest utily is chosen.
To what generic function of agents does MCTS relate to?
It is the utility function that measures its preferences among states of the world.
What should the selection process look like?
Figure 5 shows a tree with the root representing a state where P-A
has won 37/100 playouts done
P-A
has just moved to the root nodeP-B
selects a move to a node where it has won 60/79 playouts; this is the best win percentage among the available movesP-A
will select a move to a node where it has won 16/53 playouts (assuming it plays optimally)P-B
then continues on the leaf node marked 27/35It would also have been reasonable to select the 2/11 node for the sake of exploration—with only 11 playouts, the node still has high uncertainty in its valuation, and might end up being the best option if more information about it is gained. So it makes sense to use a selection policy that balances exploitation and exploration.
Upper confidence bounds applied to trees (UCT) is a very effective selection policy ranking possible moves based on an upper confidence bound formula (UCB1)
\[ \textrm{UCB1}(n) = \frac{\textrm{U}(n)}{\textrm{N}(n)} + C * \sqrt{\frac{\log{\textrm{N}(\textrm{PARENT}(n))}}{\textrm{N}(n)}} \]
With \(C=1.4\), the 60/79 node in Figure 5 has the highest UCB1 score, but with \(C=1.5\), it would be the 2/11 node.
Utiliy of MCTS
The conventional wisdom has been that Monte Carlo search has an advantage over heuristic alpha-beta tree search (not discussed here, see (Russel and Norvig 2022, 202ff)) for games where the branching factor is very high (and thus alpha-beta can’t search deep enough), or when it is difficult to define a good evaluation function 14.
Figure 6 shows a tree where a new child of the selected node is generated and marked with 0/0
(expansion).
The result of the simulation is used to update all the search tree nodes going up to the root.
P-B's
nodes are incremented in both the number of wins and the number of playoutsP-A's
nodes are incremented in the number of playouts onlySearch operates over models of the world (which might be observed online)
Problem formulation must follow goal formulation.
Discuss that statement. Is it true or false? Why?
Give a complete problem formulation for each of the following problems. Choose a formulation that is precise enough to be implemented.
Your goal is to navigate a robot out of a maze. It starts in the center of the maze facing north. You can turn the robot to face north, east, south, or west; direct the robot to move forward a certain distance (it will stop before a wall).
Open only if you need help.
Explain in your own words the following terms:
Explain if the MINIMAX
algorithm is complete and optimal.
Can it be beaten by an opponent playing suboptimally? Why (not)?
Come up with a game tree in which MAX will beat a suboptimal MIN.
Open only if you need help.
In two-player, discrete, deterministic, turn-taking zero-sum games with perfect information, the MINIMAX
algorithm can select optimal moves by a depth-first emuration, the algorithm is also guaranteed to find a solution when there is one.
The algorithm performs a complete depth-first exploration of the game tree. If the maximum depth of the tree is \(m\) and there are \(b\) legal moves at each point, then the time complexity is \(O(b^m)\) for an algorithm that generates all actions at once, or \(O(m)\) for an algorithm that generates actions on at a time. The exponential complexity makes MINIMAX
impractical for complex games. MINIMAX
does, however, serve as a basis for the mathematical analysis for games. By approximating the minimax analysis in various ways, we can derive more practical algorithms.
If MIN does not play optimally, then MAX will do at least as well as against an optimal player, possibly better.
Read the note about pruning (and consult Russel and Norvig (2022) if necessary).
Explain in your own words, under what conditions a subtree is skipped using Alpha-beta pruning.
Draw an example (game search tree, 3 levels depth).
An atomic representation is one in which each state is treated as a black box with not internal structure, meaning the state either does or does not match what you’re looking for.
A state is a situation that an agent can find itself in.
Expressed by a graph whose nodes are the set of all states, and whose links are actions that transform one state into another
Can be a single goal state, a small set of alternative goal states, or a property that applies to many states (e.g, no dirt in any location)
first-in-first-out queue first pops the node that was added to the queue first
last-in-first-out queue (also known as a stack) pops first the most recently added node
first pops the node with the minimum costs according to \(f(n)\)
Zero-sum means that what is good for one player is just as bad for the other: there is no “win-win” outcome
States where the game has ended are called terminal states
The search tree consists of AND nodes and OR nodes. AND nodes reflect the environment’s choice of action (which all need to be considers), OR nodes the agent’s own choices in each state (where only one needs to be considered). AND and OR nodes alternate. A solution is a conditional plan that considers every nondeterministic outcome and makes a plan for each one.
guided by the selection policy
For games with binary outcomes, average utility equals win percentage
A heuristic evaluation function returns an estimate of the expected utility of state \(s\) to player \(p\).
Playout policies biases the moves toward good ones. For Go and other games, playout policies have been successfully learned from self-play by using neural networks. Sometimes also game-specific heuristics are used (e.g., take the corner square in Othello)