Decision Trees

Introduction to AI (I2AI)

Andy Weeger

Neu-Ulm University of Applied Sciences

May 9, 2025

Introduction

Importance

Decision trees are a fundamental concept in knowledge-based AI agents as they …

represent a function that maps attribute values to a decision (i.e., transition model),
use search to find a decision through a sequence of tests, and
create a model that is inherently explainable.

They constitute a critical part of the AI toolkit by offering both predictive power and interpretability.

Recap: Structure

A decision tree is a representation of a function that maps a vector of attribute values to a single output value (i.e., a “decision”).

An internal node represents a test of a property
Branches are labeled with possible values of the test
Each leaf node specifies the value to be returned if that leaf is reached

In Boolean decision trees, the input is a set of vector of input attributes X and a single Boolean output value y.

Example: Restaurant Decision

Consider deciding whether to wait for a table at a restaurant (Russel and Norvig 2022, 668).

The output \(y\) is a Boolean variable \(WillWait\).

The input \(x\) is a vector of ten attributes with discrete values:

Alternate – Is there an alternative? (T/F)
Bar – Does the restaurant have a bar to wait in? (T/F)
Fri – Is it Friday or Saturday? (T/F)
Hungry – Am I hungry? (T/F)
Patrons – How many guests are there? (none, some, full)
Price – How expensive is the food? (€, €€, €€€)
WaitEstimate – How long do we have to wait? (0-10, 10-30, 30-60, >60)
Reservation – Have I made a reservation? (T/F)
Raining – Is it raining outside? (T/F)
Type – What kind of restaurant is it? (French, Italian, Thai, Burger)

Example decision tree

Figure 1: Decision tree restaurant example based on Russel and Norvig (2022, 674)

Underlying training set

Example	Alt	Bar	Fri	Hun	Pat	Price	Rain	Res	Type	Est	WillWait
\(x_1\)	Yes	No	No	Yes	Some	€€€	No	Yes	French	0-10	\(y_1 =\) Yes
\(x_2\)	Yes	No	No	Yes	Full	€	No	No	Thai	30-60	\(y_2 =\) No
\(x_3\)	No	Yes	No	No	Some	€	No	No	Burger	0-10	\(y_3 =\) Yes
\(x_4\)	Yes	No	Yes	Yes	Full	€	Yes	No	Thai	10-30	\(y_4 =\) Yes
\(x_5\)	Yes	No	Yes	No	Full	€€€	No	Yes	French	>60	\(y_5 =\) No
\(x_6\)	No	Yes	No	Yes	Some	€€	Yes	Yes	Italian	0-10	\(y_6 =\) Yes
\(x_7\)	No	Yes	No	No	None	€	Yes	No	Burger	0-10	\(y_7 =\) No
\(x_8\)	No	No	No	Yes	Some	€€	Yes	Yes	Thai	0-10	\(y_8 =\) Yes
\(x_9\)	No	Yes	Yes	No	Full	€	Yes	No	Burger	>60	\(y_9 =\) No
\(x_{10}\)	Yes	Yes	Yes	Yes	Full	€€€	No	Yes	Italian	10-30	\(y_{10}=\) No
\(x_{11}\)	No	No	No	No	None	€	No	No	Thai	0-10	\(y_{11} =\) No
\(x_{12}\)	Yes	Yes	Yes	Yes	Full	€	No	No	Burger	30-60	\(y_{12} =\) Yes

Table 1: Examples for the restaurant domain

Decision-Tree Types

Decision trees can be structured in two ways:

Binary Decision Trees

Each node has exactly two branches (True/False)

For numeric features: “Is X ≤ threshold?” (e.g., “Pages Viewed ≤ 20?”)
For categorical features: “Is X = value?” (e.g., “Referrer = Slashdot?”)

Non-Binary Decision Trees

Categorical nodes can have multiple branches

Each branch corresponds to one possible value
Often visually simpler but computationally less efficient

Binary trees are usually more efficient in implementation since they map directly to if-then-else statements in code. Non-binary trees require switch-case logic which can be less efficient. However, non-binary trees can be more intuitive for categorical features with many values.

graph LR
    %% Binary Decision Tree
    B1[Color?] --> B2[Red?]
    B2 -->|Yes| B5[Red]
    B2 -->|No| B3[Green?]
    B3 -->|Yes| B6[Green]
    B3 -->|No| B7[Other]
    
    %% Non-Binary Decision Tree
    N1[Color?] -->|Red| N2[Red]
    N1 -->|Green| N3[Green]
    N1 -->|Blue| N4[Blue]
    N1 -->|Other| N5[Other]
    
    %% Styling
    classDef default font-family:Arial,font-size:10px;
    linkStyle default stroke-width:1px,font-family:Arial,font-size:8px;

Comparison of binary and non-binary decision trees

Learning Decision Trees

Exercise

Given the following 2D data with three classes (red, green, blue), construct a decision tree with depth 3 that separates the classes as effectively as possible.

Figure 2: Training data points with three classes (red, green, blue)

Solution to the decision tree exercise

graph TD
    A[Root Node<br>x₁ ≤ 50?] -->|Yes| B[x₂ ≤ 50?]
    A -->|No| C[Class: Blue]
    
    B -->|Yes| D[Class: Red]
    B -->|No| E[Class: Green]
    
    classDef default font-family:Arial,font-size:14px;
    classDef redClass fill:#ffcccc,stroke:#ff0000,color:#cc0000,font-family:Arial;
    classDef greenClass fill:#ccffcc,stroke:#00cc00,color:#006600,font-family:Arial;
    classDef blueClass fill:#ccccff,stroke:#0000ff,color:#0000cc,font-family:Arial;
    
    class D redClass;
    class E greenClass;
    class C blueClass;

Inducing Trees

To get a naive solution, we could simply construct a tree with one path to a leaf for each example.

We test all the attributes along the path and attach the classification of the example to the leaf
This correctly classifies all examples but doesn’t generalize
It just memorizes the observations

How can we find a tree that is:

Consistent with the training set, and
small as possible to promote generalization?

It’s intractable to find the smallest consistent tree, but decision tree learning algorithms use greedy heuristics to efficiently find a reasonably small tree.

Divide-and-Conquer Strategy

Decision tree learning adopts a greedy divide-and-conquer approach:

Select the most important attribute to test at the root
Divide the training set into subsets corresponding to each value of that attribute
Recursively apply the same process to each subset

The goal is to reach the correct classification with a small number of tests, creating a tree that is both accurate and shallow.

But how do we determine which attribute is most important?

Entropy

Entropy comes from information theory and measures the unpredictability of a random variable. In the context of decision trees, we’re interested in the entropy of the class label. When entropy is high, the classes are mixed. When entropy is zero, we have a pure node with only one class present.

Formal: For a random variable \(X\) with possible values \(V(X)\):

\[H(X) = - \sum\limits_{x \in V(X)} p_x \cdot \log_2 (p_x)\]

where \(p_x\) is the probability that \(X\) has value \(x\).

\(0 \leq H(X) \leq \log_2 (|V(X)|)\)¹
\(H(X) = 0\) means complete certainty
Maximum value occurs when all outcomes are equally likely

Examples of maximum entropy

For different random variables, the maximum entropy depends on how many values they can take:

Binary variable (e.g., yes/no decision):
\(\max H(X) = \log_2(2) = 1\) bit
Achieved when \(P(X=\text{yes}) = P(X=\text{no}) = 0.5\)
Variable with 4 possible values (e.g., suit of a card):
\(\max H(X) = \log_2(4) = 2\) bits
Achieved when each value has probability 0.25
Variable with 8 possible values:
\(\max H(X) = \log_2(8) = 3\) bits

This upper bound is significant in decision tree learning because:

It tells us the maximum uncertainty possible in a node given the number of classes
It provides a reference point for measuring information gain
It helps us understand when a split is most effective (high reduction from maximum entropy)

The difference between the actual entropy and this maximum value indicates how much “structure” or “information” is already present in the data.

Example

Bob is deciding whether to go skiing based on three factors:
snow nearby, weekend, and sunny day.

Snow near	Weekend	Sun	Go Skiing
yes	yes	yes	yes
yes	yes	yes	yes
yes	yes	no	yes
yes	no	yes	yes
no	yes	yes	yes
no	yes	yes	yes
no	yes	yes	no
no	yes	no	no
no	no	yes	no
no	no	yes	no
no	no	no	no

Table 2: Data from his past decisions

What is the entropy of the class label Go Skiing?

The entropy of the class label Go Skiing is:

\[H(D) = -(\frac{6}{11} \cdot \log_2(\frac{6}{11}) + \frac{5}{11} \cdot \log_2(\frac{5}{11})) = 0.994\]

The class label Go Skiing has probability 6/11 for yes and 5/11 for no. The resulting entropy is close to 1, indicating high uncertainty (the maximum entropy for a binary variable is 1).

Information Gain

Information gain measures how much the entropy decreases when we split the data based on a particular attribute.

Formal: for a dataset \(D\) and attribute \(X\):

\[G(D,X) = H(D) - \sum\limits_{x \in V(X)} \frac{|D_x|}{|D|} H(D_x)\]

where:

\(H(D)\) is the entropy of the original dataset
\(D_x\) is the subset of \(D\) where attribute \(X\) has value \(x\)
\(|D_x|\) is the size of subset \(D_x\)
\(H(D_x)\) is the entropy of subset \(D_x\)

The attribute with the highest information gain is the most important attribute.

Example

For Bob’s skiing data, let’s calculate the information gain for the Snow near attribute.

First, split the data based on Snow near”:

\(D_{yes}\) (4 examples, all Go Skiing = yes, entropy = 0)
\(D_{no}\) (7 examples, 2 yes and 5 no, entropy = 0.863)

The information gain is:

\(G(D,\text{Snow near}) = 0.994 - \frac{4}{11} \cdot 0 - \frac{7}{11} \cdot 0.863 = 0.445\)

Similarly, we can calculate:

\(G(D,\text{Weekend}) = 0.150\)
\(G(D,\text{Sun}) = 0.049\)

Since “Snow near” has the highest information gain, we select it as the root node.

Building the Tree

After selecting Snow near as our root, we continue the process recursively for each subset.

For \(D_{yes}\) (all examples have Go Skiing = yes):
This is a pure node, so we’re done with this branch

For \(D_{no}\) (mixed yes and no):

Calculate information gain for remaining attributes on this subset
Weekend has higher gain (0.292) than Sun (0.169)
Split on Weekend
Continue recursively …

%%| fig-cap: "Decision tree for Bob's skiing decisions"
%%| fig-width: 8
%%| fig-height: 6

graph TD
    A["Snow near?"] -->|Yes| B["Yes"]
    A -->|No| C["Weekend?"]
    C -->|Yes| D["Sun?"]
    C -->|No| E["No"]
    D -->|Yes| F["(2 Yes, 1 No)"]
    D -->|No| G["No"]
    
    %% Apply Arial font to all elements including edges
    classDef default font-family:Arial,font-size:14px;
    linkStyle default font-family:Arial,font-size:12px,fill:none,stroke-width:1px;
    
    %% Node styling
    classDef decision fill:#f9f9f9,stroke:#333,stroke-width:1px,font-family:Arial;
    classDef yes fill:#d4f4d4,stroke:#060,stroke-width:1px,color:#060,font-family:Arial;
    classDef no fill:#f4d4d4,stroke:#600,stroke-width:1px,color:#600,font-family:Arial;
    classDef mixed fill:#f4f4d4,stroke:#660,stroke-width:1px,color:#660,font-family:Arial;
    
    class A,C,D decision;
    class B,F yes;
    class E,G no;
    class F mixed;

Gini Impurity

Instead of information gain, the Gini impurity is another common criterion that measures the probability of incorrectly classifying a randomly chosen element if it were randomly labeled according to the class distribution in a subset:

\[\text{Gini}(X) = \sum\limits_{x \in V(X)} p_x(1-p_x)\]

Where \(V(X)\) are the class values in node \(X\) and \(p_x\) is the probability of class \(x\).

Properties:

Measures the probability of misclassifying a randomly chosen element
0 means all elements belong to the same class
Maximum value occurs when classes are equally likely
A lower Gini value indicates a “purer” node (more homogeneous)
Often used in CART (Classification and Regression Trees) algorithm

Numerical Attributes

For numerical attributes:

Sort the values of the attribute in the training set
Consider all possible thresholds between adjacent values
Calculate information gain for each threshold
Select the threshold with the highest information gain

Recursive Learning Process

In each recursive step of decision tree learning, there are four cases to consider:

Mixed examples (positive and negative): Choose the best attribute and split
Pure node (all examples have same class): Create a leaf node with that class
No examples: Create a leaf with the majority class from the parent node
No attributes left but mixed classes: Create a leaf with the majority class (handles noisy data)

This process continues until all branches end in leaf nodes.

Preventing Overfitting

The Overfitting Problem

Decision trees face a fundamental trade-off between data fit and generalization.

A deeper tree can fit the training data more perfectly
But a tree that’s too deep might capture noise rather than true patterns
This results in poor generalization to new data

Pruning Techniques

Pruning reduces the size of decision trees to prevent overfitting.

Pre-pruning

Stop growing the tree while building to limit maximum depth.

Require minimum samples per node
Set minimum information gain threshold

Post-pruning

Build the full tree, then remove sections

Reduced error pruning — replace nodes with their most common class if it doesn’t increase error on a validation set
Cost-complexity pruning — balance accuracy against tree size using a penalty parameter

Both approaches improve generalization by reducing model complexity.

Performance Assessment

To properly evaluate a decision tree model:

1. Divide your data into separate sets

Training set (e.g., 70%) — used to build the tree
Validation set (e.g., 10%) — used for pruning decisions
Test set (e.g., 20%) — used only for final evaluation

2. Measure accuracy on the test set

3. Use cross-validation² for more robust assessment

As the training set grows, prediction quality usually increases, then plateaus.

Usage & Extensions

Real-World Applications

Decision trees are used across many domains:

Finance: Credit scoring, fraud detection
Healthcare: Disease diagnosis, treatment planning
Marketing: Customer segmentation, churn prediction
Operations: Quality control, maintenance scheduling

Their simplicity and interpretability make them particularly valuable when decisions need to be explained to stakeholders.

Ensemble Methods

Ensemble methods address the limitations of single decision trees by combining multiple trees.

Random Forests: Build many trees on random subsets of data and features, then average their predictions
Gradient Boosting: Build trees sequentially, with each tree correcting errors made by previous trees
AdaBoost: Weight samples based on classification difficulty

These ensemble approaches often dramatically outperform single decision trees while retaining some interpretability.

Strengths and Weaknesses

Strengths	Weaknesses
Easy to understand and interpret	Can create overly complex trees that don’t generalize well
Minimal data preparation required	Small changes in data can result in a very different tree
Handles both numerical and categorical data	Biased toward attributes with more levels
Handles missing values well	Struggles with diagonal decision boundaries
Computationally inexpensive	Generally lower accuracy than ensemble methods

Understanding these trade-offs helps in choosing when to use decision trees.

Summary

Decision trees represent functions by sequencing attribute tests
They excel at explainability but can struggle with certain function types
Tree learning algorithms use information gain to select the most informative attributes
Pruning techniques help prevent overfitting
While simple, decision trees form the foundation for powerful ensemble methods

Decision trees balance performance and interpretability, making them a valuable tool in any data scientist’s toolkit.

Divide-and-conquer

Create the decision tree by applying the divide-and-conquer approach on the restaurant examples.

Compare the naive tree with the tree gained by applying the divide-and-conquer heuristic. What differences do you see?.

Decision tree

Explain how decision trees can be used to create a learning agent. Relate your answers to the components outlined in the lecture notes.

We never test the same attribute twice along one path in a decision tree. Why not?

Information Gain

Consider the following dataset about weather conditions and whether tennis matches were played:

Day	Outlook	Temperature	Humidity	Windy	Play Tennis
1	Sunny	Hot	High	No	No
2	Sunny	Hot	High	Yes	No
3	Overcast	Hot	High	No	Yes
4	Rain	Mild	High	No	Yes
5	Rain	Cool	Normal	No	Yes
6	Rain	Cool	Normal	Yes	No
7	Overcast	Cool	Normal	Yes	Yes
8	Sunny	Mild	High	No	No
9	Sunny	Cool	Normal	No	Yes
10	Rain	Mild	Normal	No	Yes
11	Sunny	Mild	Normal	Yes	Yes
12	Overcast	Mild	High	Yes	Yes
13	Overcast	Hot	Normal	No	Yes
14	Rain	Mild	High	Yes	No

Calculate the entropy of the Play Tennis attribute for the entire dataset.
Calculate the information gain for each of the four attributes (Outlook, Temperature, Humidity, Windy).
Which attribute should be selected as the root node of the decision tree?
Draw the first level of the decision tree.
For the Outlook = Sunny branch, calculate which attribute should be tested next.

Hint

For the entropy calculation, count how many Yes and No instances there are in the Play Tennis column. Remember that the entropy formula is:

\[H(X) = - \sum\limits_{x \in V(X)} p_x \cdot \log_2 (p_x)\]

For information gain, you’ll need to split the data based on each attribute value and calculate the weighted entropy.

Numeric Attributes

Consider the following dataset for predicting credit risk based on income and debt levels:

Customer ID	Income (1000€)	Debt (1000€)	Credit Risk
1	45	10	Low
2	32	12	Low
3	85	15	Low
4	38	20	High
5	48	28	High
6	29	18	High
7	56	5	Low
8	22	10	High
9	70	8	Low
10	35	25	High

List all potential splitting thresholds that should be considered for the Income attribute.
Calculate the information gain for the threshold Income ≤ 40.
List all potential splitting thresholds for the Debt attribute.
Calculate the information gain for the threshold Debt ≤ 15.
Which numerical split would be chosen for the root node of the decision tree?

Hint

For numerical attributes, consider threshold values that are midpoints between adjacent values in the sorted list of values. For example, if you have values 10, 15, and 20, the potential thresholds would be 12.5 and 17.5.

Decision Boundaries

Consider a two-dimensional feature space with two attributes: x₁ and x₂. The following points represent different classes:

Class A: (2,3), (3,2), (3,3), (4,3)
Class B: (1,1), (2,1), (2,2)
Class C: (4,1), (4,2), (5,1), (5,2)

Visualize these points in a 2D coordinate system.
Construct a decision tree with a maximum depth of 2 (counting the root as depth 0) that separates the classes as well as possible.
Draw the decision boundaries created by your tree on the 2D plot.
What is the classification accuracy of your tree on the training data?
If a new point (3,1) is encountered, how would your tree classify it?

Hint

Start by identifying the best horizontal or vertical split that separates the classes. Then, apply the same process to each resulting region. The decision boundaries will form axis-parallel lines in the 2D space.

Pruning

Consider the following decision tree trained on a small dataset:

graph TD
    A[Age ≤ 30?] -->|Yes| B[B; Income ≤ 40K?]
    A -->|No| C[C; Income ≤ 60K?]
    
    B -->|Yes| D[D; Buy = No<br>3/4 correct]
    B -->|No| E[E; Education?]
    
    C -->|Yes| F[F; Education?]
    C -->|No| G[G; Buy = Yes<br>5/6 correct]
    
    E -->|High| H[H; Buy = Yes<br>2/2 correct]
    E -->|Low| I[I; Buy = No<br>1/1 correct]
    
    F -->|High| J[J; Buy = Yes<br>3/4 correct]
    F -->|Low| K[K; Buy = No<br>2/3 correct]

Unpruned decision tree

The tree makes some errors on the training data, as indicated by the fractions (e.g., “3/4 correct” means the node correctly classifies 3 out of 4 training examples that reach that node).

You also have a validation set with the following distribution:

Node	Validation examples	Correctly classified
D	8	5
H	3	2
I	2	1
J	6	3
K	5	3
G	10	7

Calculate the classification accuracy of the unpruned tree on the validation set.
Identify which nodes, if any, should be pruned.
Explain why pruning improved or worsened the performance.

Hint

In reduced error pruning, replace a node with its most common class if doing so doesn’t decrease accuracy on the validation set. Start with the leaf nodes and work your way up. For each non-leaf node, calculate the accuracy before and after replacing it with a leaf.

No subtree pruning improves accuracy, so no pruning should be applied.

Literature

Russel, Stuart, and Peter Norvig. 2022. Artificial Intelligence: A Modern Approach. Harlow: Pearson Education.

Footnotes

\(\log_2 (|V(X)|)\) describes the upper bound of entropy — the maximum possible entropy for a random variable \(X\) is \(\log_2 (|V(X)|)\)
Multiple train-test splits are created and averaged.

Snow near	Weekend	Sun	Go Skiing
yes	yes	yes	yes
yes	yes	yes	yes
yes	yes	no	yes
yes	no	yes	yes
no	yes	yes	yes
no	yes	yes	yes
no	yes	yes	no
no	yes	no	no
no	no	yes	no
no	no	yes	no
no	no	no	no

Snow near	Weekend	Sun	Go Skiing
yes	yes	yes	yes
yes	yes	yes	yes
yes	yes	no	yes
yes	no	yes	yes
no	yes	yes	yes
no	yes	yes	yes
no	yes	yes	no
no	yes	no	no
no	no	yes	no
no	no	yes	no
no	no	no	no

Snow near	Weekend	Sun	Go Skiing
yes	yes	yes	yes
yes	yes	yes	yes
yes	yes	no	yes
yes	no	yes	yes
no	yes	yes	yes
no	yes	yes	yes
no	yes	yes	no
no	yes	no	no
no	no	yes	no
no	no	yes	no
no	no	no	no