Decision Trees

Introduction to AI (I2AI)

Andy Weeger

Neu-Ulm University of Applied Sciences

May 9, 2025

Introduction

Importance

Decision trees are a fundamental concept in knowledge-based AI agents as they …

  • represent a function that maps attribute values to a decision (i.e., transition model),
  • use search to find a decision through a sequence of tests, and
  • create a model that is inherently explainable.

They constitute a critical part of the AI toolkit by offering both predictive power and interpretability.

Recap: Structure

A decision tree is a representation of a function that maps a vector of attribute values to a single output value (i.e., a “decision”).

  • An internal node represents a test of a property
  • Branches are labeled with possible values of the test
  • Each leaf node specifies the value to be returned if that leaf is reached

In Boolean decision trees, the input is a set of vector of input attributes X and a single Boolean output value y.

Example: Restaurant Decision

Consider deciding whether to wait for a table at a restaurant (Russel and Norvig 2022, 668).

The output \(y\) is a Boolean variable \(WillWait\).

The input \(x\) is a vector of ten attributes with discrete values:

  • Alternate – Is there an alternative? (T/F)
  • Bar – Does the restaurant have a bar to wait in? (T/F)
  • Fri – Is it Friday or Saturday? (T/F)
  • Hungry – Am I hungry? (T/F)
  • Patrons – How many guests are there? (none, some, full)
  • Price – How expensive is the food? (€, €€, €€€)
  • WaitEstimate – How long do we have to wait? (0-10, 10-30, 30-60, >60)
  • Reservation – Have I made a reservation? (T/F)
  • Raining – Is it raining outside? (T/F)
  • Type – What kind of restaurant is it? (French, Italian, Thai, Burger)

Example decision tree

Figure 1: Decision tree restaurant example based on Russel and Norvig (2022, 674)

Underlying training set

Example Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait
\(x_1\) Yes No No Yes Some €€€ No Yes French 0-10 \(y_1 =\) Yes
\(x_2\) Yes No No Yes Full No No Thai 30-60 \(y_2 =\) No
\(x_3\) No Yes No No Some No No Burger 0-10 \(y_3 =\) Yes
\(x_4\) Yes No Yes Yes Full Yes No Thai 10-30 \(y_4 =\) Yes
\(x_5\) Yes No Yes No Full €€€ No Yes French >60 \(y_5 =\) No
\(x_6\) No Yes No Yes Some €€ Yes Yes Italian 0-10 \(y_6 =\) Yes
\(x_7\) No Yes No No None Yes No Burger 0-10 \(y_7 =\) No
\(x_8\) No No No Yes Some €€ Yes Yes Thai 0-10 \(y_8 =\) Yes
\(x_9\) No Yes Yes No Full Yes No Burger >60 \(y_9 =\) No
\(x_{10}\) Yes Yes Yes Yes Full €€€ No Yes Italian 10-30 \(y_{10}=\) No
\(x_{11}\) No No No No None No No Thai 0-10 \(y_{11} =\) No
\(x_{12}\) Yes Yes Yes Yes Full No No Burger 30-60 \(y_{12} =\) Yes
Table 1: Examples for the restaurant domain

Decision-Tree Types

Decision trees can be structured in two ways:

Binary Decision Trees

Each node has exactly two branches (True/False)

  • For numeric features: “Is X ≤ threshold?” (e.g., “Pages Viewed ≤ 20?”)
  • For categorical features: “Is X = value?” (e.g., “Referrer = Slashdot?”)

Non-Binary Decision Trees

Categorical nodes can have multiple branches

  • Each branch corresponds to one possible value
  • Often visually simpler but computationally less efficient

Learning Decision Trees

Exercise

Given the following 2D data with three classes (red, green, blue), construct a decision tree with depth 3 that separates the classes as effectively as possible.

Figure 2: Training data points with three classes (red, green, blue)

Solution to the decision tree exercise

graph TD
    A[Root Node<br>x₁ ≤ 50?] -->|Yes| B[x₂ ≤ 50?]
    A -->|No| C[Class: Blue]
    
    B -->|Yes| D[Class: Red]
    B -->|No| E[Class: Green]
    
    classDef default font-family:Arial,font-size:14px;
    classDef redClass fill:#ffcccc,stroke:#ff0000,color:#cc0000,font-family:Arial;
    classDef greenClass fill:#ccffcc,stroke:#00cc00,color:#006600,font-family:Arial;
    classDef blueClass fill:#ccccff,stroke:#0000ff,color:#0000cc,font-family:Arial;
    
    class D redClass;
    class E greenClass;
    class C blueClass;

Inducing Trees

To get a naive solution, we could simply construct a tree with one path to a leaf for each example.

  • We test all the attributes along the path and attach the classification of the example to the leaf
  • This correctly classifies all examples but doesn’t generalize
  • It just memorizes the observations

How can we find a tree that is:

  1. Consistent with the training set, and
  2. ss small as possible to promote generalization?

It’s intractable to find the smallest consistent tree, but decision tree learning algorithms use greedy heuristics to efficiently find a reasonably small tree.

Divide-and-Conquer Strategy

Decision tree learning adopts a greedy divide-and-conquer approach:

  1. Select the most important attribute to test at the root
  2. Divide the training set into subsets corresponding to each value of that attribute
  3. Recursively apply the same process to each subset

The goal is to reach the correct classification with a small number of tests, creating a tree that is both accurate and shallow.

But how do we determine which attribute is most important?

Entropy

Entropy comes from information theory and measures the unpredictability of a random variable. In the context of decision trees, we’re interested in the entropy of the class label. When entropy is high, the classes are mixed. When entropy is zero, we have a pure node with only one class present.

Formal: For a random variable \(X\) with possible values \(V(X)\):

\[H(X) = - \sum\limits_{x \in V(X)} p_x \cdot \log_2 (p_x)\]

where \(p_x\) is the probability that \(X\) has value \(x\).

  • \(0 \leq H(X) \leq \log_2 (|V(X)|)\)1
  • \(H(X) = 0\) means complete certainty
  • Maximum value occurs when all outcomes are equally likely

Example

Bob is deciding whether to go skiing based on three factors:
snow nearby, weekend, and sunny day.

Snow near Weekend Sun Go Skiing
yes yes yes yes
yes yes yes yes
yes yes no yes
yes no yes yes
no yes yes yes
no yes yes yes
no yes yes no
no yes no no
no no yes no
no no yes no
no no no no
Table 2: Data from his past decisions

What is the entropy of the class label Go Skiing?

The entropy of the class label Go Skiing is:

\[H(D) = -(\frac{6}{11} \cdot \log_2(\frac{6}{11}) + \frac{5}{11} \cdot \log_2(\frac{5}{11})) = 0.994\]

The class label Go Skiing has probability 6/11 for yes and 5/11 for no. The resulting entropy is close to 1, indicating high uncertainty (the maximum entropy for a binary variable is 1).

Information Gain

Information gain measures how much the entropy decreases when we split the data based on a particular attribute.

Formal: for a dataset \(D\) and attribute \(X\):

\[G(D,X) = H(D) - \sum\limits_{x \in V(X)} \frac{|D_x|}{|D|} H(D_x)\]

where:

  • \(H(D)\) is the entropy of the original dataset
  • \(D_x\) is the subset of \(D\) where attribute \(X\) has value \(x\)
  • \(|D_x|\) is the size of subset \(D_x\)
  • \(H(D_x)\) is the entropy of subset \(D_x\)

The attribute with the highest information gain is the most important attribute.

Example

For Bob’s skiing data, let’s calculate the information gain for the Snow near attribute.

First, split the data based on Snow near”:

  • \(D_{yes}\) (4 examples, all Go Skiing = yes, entropy = 0)
  • \(D_{no}\) (7 examples, 2 yes and 5 no, entropy = 0.863)

The information gain is:

\(G(D,\text{Snow near}) = 0.994 - \frac{4}{11} \cdot 0 - \frac{7}{11} \cdot 0.863 = 0.445\)

Similarly, we can calculate:

  • \(G(D,\text{Weekend}) = 0.150\)
  • \(G(D,\text{Sun}) = 0.049\)

Since “Snow near” has the highest information gain, we select it as the root node.

Building the Tree

After selecting Snow near as our root, we continue the process recursively for each subset.

For \(D_{yes}\) (all examples have Go Skiing = yes):
This is a pure node, so we’re done with this branch

For \(D_{no}\) (mixed yes and no):

  • Calculate information gain for remaining attributes on this subset
  • Weekend has higher gain (0.292) than Sun (0.169)
  • Split on Weekend
  • Continue recursively …

%%| fig-cap: "Decision tree for Bob's skiing decisions"
%%| fig-width: 8
%%| fig-height: 6

graph TD
    A["Snow near?"] -->|Yes| B["Yes"]
    A -->|No| C["Weekend?"]
    C -->|Yes| D["Sun?"]
    C -->|No| E["No"]
    D -->|Yes| F["(2 Yes, 1 No)"]
    D -->|No| G["No"]
    
    %% Apply Arial font to all elements including edges
    classDef default font-family:Arial,font-size:14px;
    linkStyle default font-family:Arial,font-size:12px,fill:none,stroke-width:1px;
    
    %% Node styling
    classDef decision fill:#f9f9f9,stroke:#333,stroke-width:1px,font-family:Arial;
    classDef yes fill:#d4f4d4,stroke:#060,stroke-width:1px,color:#060,font-family:Arial;
    classDef no fill:#f4d4d4,stroke:#600,stroke-width:1px,color:#600,font-family:Arial;
    classDef mixed fill:#f4f4d4,stroke:#660,stroke-width:1px,color:#660,font-family:Arial;
    
    class A,C,D decision;
    class B,F yes;
    class E,G no;
    class F mixed;

Gini Impurity

Instead of information gain, the Gini impurity is another common criterion:

\[\text{Gini}(X) = \sum\limits_{x \in V(X)} p_x(1-p_x)\]

Where \(V(X)\) are the class values in node \(X\) and \(p_x\) is the probability of class \(x\).

Properties:

  • Measures the probability of misclassifying a randomly chosen element
  • 0 means all elements belong to the same class
  • Maximum value occurs when classes are equally likely
  • Often used in CART (Classification and Regression Trees) algorithm

Numerical Attributes

For numerical attributes:

  1. Sort the values of the attribute in the training set
  2. Consider all possible thresholds between adjacent values
  3. Calculate information gain for each threshold
  4. Select the threshold with the highest information gain

Recursive Learning Process

In each recursive step of decision tree learning, there are four cases to consider:

  1. Mixed examples (positive and negative): Choose the best attribute and split
  2. Pure node (all examples have same class): Create a leaf node with that class
  3. No examples: Create a leaf with the majority class from the parent node
  4. No attributes left but mixed classes: Create a leaf with the majority class (handles noisy data)

This process continues until all branches end in leaf nodes.

Preventing Overfitting

The Overfitting Problem

Decision trees face a fundamental trade-off between data fit and generalization.

  • A deeper tree can fit the training data more perfectly
  • But a tree that’s too deep might capture noise rather than true patterns
  • This results in poor generalization to new data

Pruning Techniques

Pruning reduces the size of decision trees to prevent overfitting.

Pre-pruning

Stop growing the tree while building to limit maximum depth.

  • Require minimum samples per node
  • Set minimum information gain threshold

Post-pruning

Build the full tree, then remove sections

  • Reduced error pruning — replace nodes with their most common class if it doesn’t increase error on a validation set
  • Cost-complexity pruning — balance accuracy against tree size using a penalty parameter

Both approaches improve generalization by reducing model complexity.

Performance Assessment

To properly evaluate a decision tree model:

1. Divide your data into separate sets

  • Training set (e.g., 70%) — used to build the tree
  • Validation set (e.g., 10%) — used for pruning decisions
  • Test set (e.g., 20%) — used only for final evaluation

2. Measure accuracy on the test set

3. Use cross-validation2 for more robust assessment

As the training set grows, prediction quality usually increases, then plateaus.

Usage & Extensions

Real-World Applications

Decision trees are used across many domains:

  • Finance: Credit scoring, fraud detection
  • Healthcare: Disease diagnosis, treatment planning
  • Marketing: Customer segmentation, churn prediction
  • Operations: Quality control, maintenance scheduling

Their simplicity and interpretability make them particularly valuable when decisions need to be explained to stakeholders.

Ensemble Methods

Ensemble methods address the limitations of single decision trees by combining multiple trees.

  • Random Forests: Build many trees on random subsets of data and features, then average their predictions
  • Gradient Boosting: Build trees sequentially, with each tree correcting errors made by previous trees
  • AdaBoost: Weight samples based on classification difficulty

These ensemble approaches often dramatically outperform single decision trees while retaining some interpretability.

Strengths and Weaknesses

Strengths Weaknesses
Easy to understand and interpret Can create overly complex trees that don’t generalize well
Minimal data preparation required Small changes in data can result in a very different tree
Handles both numerical and categorical data Biased toward attributes with more levels
Handles missing values well Struggles with diagonal decision boundaries
Computationally inexpensive Generally lower accuracy than ensemble methods

Understanding these trade-offs helps in choosing when to use decision trees.

Summary

  • Decision trees represent functions by sequencing attribute tests
  • They excel at explainability but can struggle with certain function types
  • Tree learning algorithms use information gain to select the most informative attributes
  • Pruning techniques help prevent overfitting
  • While simple, decision trees form the foundation for powerful ensemble methods

Decision trees balance performance and interpretability, making them a valuable tool in any data scientist’s toolkit.

Divide-and-conquer

Create the decision tree by applying the divide-and-conquer approach on the restaurant examples.

Compare the naive tree with the tree gained by applying the divide-and-conquer heuristic. What differences do you see?.

Decision tree

Explain how decision trees can be used to create a learning agent. Relate your answers to the components outlined in the lecture notes.

We never test the same attribute twice along one path in a decision tree. Why not?

Information Gain

Consider the following dataset about weather conditions and whether tennis matches were played:

Day Outlook Temperature Humidity Windy Play Tennis
1 Sunny Hot High No No
2 Sunny Hot High Yes No
3 Overcast Hot High No Yes
4 Rain Mild High No Yes
5 Rain Cool Normal No Yes
6 Rain Cool Normal Yes No
7 Overcast Cool Normal Yes Yes
8 Sunny Mild High No No
9 Sunny Cool Normal No Yes
10 Rain Mild Normal No Yes
11 Sunny Mild Normal Yes Yes
12 Overcast Mild High Yes Yes
13 Overcast Hot Normal No Yes
14 Rain Mild High Yes No
  1. Calculate the entropy of the Play Tennis attribute for the entire dataset.
  2. Calculate the information gain for each of the four attributes (Outlook, Temperature, Humidity, Windy).
  3. Which attribute should be selected as the root node of the decision tree?
  4. Draw the first level of the decision tree.
  5. For the Outlook = Sunny branch, calculate which attribute should be tested next.

Hint

For the entropy calculation, count how many Yes and No instances there are in the Play Tennis column. Remember that the entropy formula is:

\[H(X) = - \sum\limits_{x \in V(X)} p_x \cdot \log_2 (p_x)\]

For information gain, you’ll need to split the data based on each attribute value and calculate the weighted entropy.

Numeric Attributes

Consider the following dataset for predicting credit risk based on income and debt levels:

Customer ID Income (1000€) Debt (1000€) Credit Risk
1 45 10 Low
2 32 12 Low
3 85 15 Low
4 38 20 High
5 48 28 High
6 29 18 High
7 56 5 Low
8 22 10 High
9 70 8 Low
10 35 25 High
  1. List all potential splitting thresholds that should be considered for the Income attribute.
  2. Calculate the information gain for the threshold Income ≤ 40.
  3. List all potential splitting thresholds for the Debt attribute.
  4. Calculate the information gain for the threshold Debt ≤ 15.
  5. Which numerical split would be chosen for the root node of the decision tree?

Hint

For numerical attributes, consider threshold values that are midpoints between adjacent values in the sorted list of values. For example, if you have values 10, 15, and 20, the potential thresholds would be 12.5 and 17.5.

Decision Boundaries

Consider a two-dimensional feature space with two attributes: x₁ and x₂. The following points represent different classes:

  • Class A: (2,3), (3,2), (3,3), (4,3)
  • Class B: (1,1), (2,1), (2,2)
  • Class C: (4,1), (4,2), (5,1), (5,2)
  1. Visualize these points in a 2D coordinate system.
  2. Construct a decision tree with a maximum depth of 2 (counting the root as depth 0) that separates the classes as well as possible.
  3. Draw the decision boundaries created by your tree on the 2D plot.
  4. What is the classification accuracy of your tree on the training data?
  5. If a new point (3,1) is encountered, how would your tree classify it?

Hint

Start by identifying the best horizontal or vertical split that separates the classes. Then, apply the same process to each resulting region. The decision boundaries will form axis-parallel lines in the 2D space.

Pruning

Consider the following decision tree trained on a small dataset:

graph TD
    A[Age ≤ 30?] -->|Yes| B[Income ≤ 40K?]
    A -->|No| C[Income ≤ 60K?]
    
    B -->|Yes| D[Buy = No<br>3/4 correct]
    B -->|No| E[Education?]
    
    C -->|Yes| F[Education?]
    C -->|No| G[Buy = Yes<br>5/6 correct]
    
    E -->|High| H[Buy = Yes<br>2/2 correct]
    E -->|Low| I[Buy = No<br>1/1 correct]
    
    F -->|High| J[Buy = Yes<br>3/4 correct]
    F -->|Low| K[Buy = No<br>2/3 correct]

Unpruned decision tree

The tree makes some errors on the training data, as indicated by the fractions (e.g., “3/4 correct” means the node correctly classifies 3 out of 4 training examples that reach that node).

You also have a validation set with the following distribution:

Node Validation examples Correctly classified
D 8 5
H 3 2
I 2 1
J 6 3
K 5 3
G 10 7
  1. Calculate the classification accuracy of the unpruned tree on the validation set.
  2. Apply reduced error pruning to the tree. Identify which nodes, if any, should be pruned.
  3. Draw the resulting pruned tree.
  4. Calculate the classification accuracy of the pruned tree on the validation set.
  5. Explain why pruning improved or worsened the performance.

Hint

In reduced error pruning, replace a node with its most common class if doing so doesn’t decrease accuracy on the validation set. Start with the leaf nodes and work your way up. For each non-leaf node, calculate the accuracy before and after replacing it with a leaf.

Gini Index vs. Entropy

Consider the following dataset for customer churn prediction:

Customer Age Usage Contract Length Churn
1 Young Low Month-to-month Yes
2 Young Medium Year No
3 Young High Month-to-month No
4 Middle Low Month-to-month Yes
5 Middle Medium Year Yes
6 Middle High Month-to-month No
7 Senior Low Month-to-month Yes
8 Senior Medium Year No
9 Senior High Year No
10 Senior High Month-to-month Yes
  1. Calculate the Gini index for the entire dataset with respect to the Churn attribute.
  2. For the Contract Length attribute, calculate:
    1. The Gini index for each split
    2. The weighted Gini index after splitting
    3. The Gini gain (reduction in impurity)
  3. For the same Contract Length attribute, calculate:
    1. The entropy for each split
    2. The weighted entropy after splitting
    3. The information gain
  4. Which measure (Gini or entropy) indicates Contract Length is more useful for splitting?
  5. Explain a scenario where Gini index and entropy might lead to different tree structures.

Hint

The Gini index for a node with class distribution \((p_1, p_2, ..., p_k)\) is calculated as:

\[Gini(D) = 1 - \sum_{i=1}^{k} p_i^2\]

For a binary classification like “Churn” (Yes/No), this simplifies to:

\[Gini(D) = 1 - (p_{yes}^2 + p_{no}^2)\]

The Gini gain is similar to information gain: it’s the difference between the parent node’s Gini index and the weighted average of the children nodes’ Gini indices.

Literature

Russel, Stuart, and Peter Norvig. 2022. Artificial Intelligence: A Modern Approach. Harlow: Pearson Education.

Footnotes

  1. \(\log_2 (|V(X)|)\) describes the upper bound of entropy — the maximum possible entropy for a random variable \(X\) is \(\log_2 (|V(X)|)\)

  2. Multiple train-test splits are created and averaged.