Introduction to AI (I2AI)
Neu-Ulm University of Applied Sciences
May 9, 2025
Decision trees are a fundamental concept in knowledge-based AI agents as they …
They constitute a critical part of the AI toolkit by offering both predictive power and interpretability.
A decision tree is a representation of a function that maps a vector of attribute values to a single output value (i.e., a “decision”).
In Boolean decision trees, the input is a set of vector of input attributes X and a single Boolean output value y.
Consider deciding whether to wait for a table at a restaurant (Russel and Norvig 2022, 668).
The output \(y\) is a Boolean variable \(WillWait\).
The input \(x\) is a vector of ten attributes with discrete values:
Example decision tree
Underlying training set
Example | Alt | Bar | Fri | Hun | Pat | Price | Rain | Res | Type | Est | WillWait |
---|---|---|---|---|---|---|---|---|---|---|---|
\(x_1\) | Yes | No | No | Yes | Some | €€€ | No | Yes | French | 0-10 | \(y_1 =\) Yes |
\(x_2\) | Yes | No | No | Yes | Full | € | No | No | Thai | 30-60 | \(y_2 =\) No |
\(x_3\) | No | Yes | No | No | Some | € | No | No | Burger | 0-10 | \(y_3 =\) Yes |
\(x_4\) | Yes | No | Yes | Yes | Full | € | Yes | No | Thai | 10-30 | \(y_4 =\) Yes |
\(x_5\) | Yes | No | Yes | No | Full | €€€ | No | Yes | French | >60 | \(y_5 =\) No |
\(x_6\) | No | Yes | No | Yes | Some | €€ | Yes | Yes | Italian | 0-10 | \(y_6 =\) Yes |
\(x_7\) | No | Yes | No | No | None | € | Yes | No | Burger | 0-10 | \(y_7 =\) No |
\(x_8\) | No | No | No | Yes | Some | €€ | Yes | Yes | Thai | 0-10 | \(y_8 =\) Yes |
\(x_9\) | No | Yes | Yes | No | Full | € | Yes | No | Burger | >60 | \(y_9 =\) No |
\(x_{10}\) | Yes | Yes | Yes | Yes | Full | €€€ | No | Yes | Italian | 10-30 | \(y_{10}=\) No |
\(x_{11}\) | No | No | No | No | None | € | No | No | Thai | 0-10 | \(y_{11} =\) No |
\(x_{12}\) | Yes | Yes | Yes | Yes | Full | € | No | No | Burger | 30-60 | \(y_{12} =\) Yes |
Decision trees can be structured in two ways:
Binary Decision Trees
Each node has exactly two branches (True/False)
Non-Binary Decision Trees
Categorical nodes can have multiple branches
Given the following 2D data with three classes (red, green, blue), construct a decision tree with depth 3 that separates the classes as effectively as possible.
Solution to the decision tree exercise
graph TD A[Root Node<br>x₁ ≤ 50?] -->|Yes| B[x₂ ≤ 50?] A -->|No| C[Class: Blue] B -->|Yes| D[Class: Red] B -->|No| E[Class: Green] classDef default font-family:Arial,font-size:14px; classDef redClass fill:#ffcccc,stroke:#ff0000,color:#cc0000,font-family:Arial; classDef greenClass fill:#ccffcc,stroke:#00cc00,color:#006600,font-family:Arial; classDef blueClass fill:#ccccff,stroke:#0000ff,color:#0000cc,font-family:Arial; class D redClass; class E greenClass; class C blueClass;
To get a naive solution, we could simply construct a tree with one path to a leaf for each example.
How can we find a tree that is:
It’s intractable to find the smallest consistent tree, but decision tree learning algorithms use greedy heuristics to efficiently find a reasonably small tree.
Decision tree learning adopts a greedy divide-and-conquer approach:
The goal is to reach the correct classification with a small number of tests, creating a tree that is both accurate and shallow.
But how do we determine which attribute is most important?
Entropy comes from information theory and measures the unpredictability of a random variable. In the context of decision trees, we’re interested in the entropy of the class label. When entropy is high, the classes are mixed. When entropy is zero, we have a pure node with only one class present.
Formal: For a random variable \(X\) with possible values \(V(X)\):
\[H(X) = - \sum\limits_{x \in V(X)} p_x \cdot \log_2 (p_x)\]
where \(p_x\) is the probability that \(X\) has value \(x\).
Bob is deciding whether to go skiing based on three factors:
snow nearby, weekend, and sunny day.
Snow near | Weekend | Sun | Go Skiing |
---|---|---|---|
yes | yes | yes | yes |
yes | yes | yes | yes |
yes | yes | no | yes |
yes | no | yes | yes |
no | yes | yes | yes |
no | yes | yes | yes |
no | yes | yes | no |
no | yes | no | no |
no | no | yes | no |
no | no | yes | no |
no | no | no | no |
What is the entropy of the class label Go Skiing?
The entropy of the class label Go Skiing is:
\[H(D) = -(\frac{6}{11} \cdot \log_2(\frac{6}{11}) + \frac{5}{11} \cdot \log_2(\frac{5}{11})) = 0.994\]
The class label Go Skiing has probability 6/11 for yes and 5/11 for no. The resulting entropy is close to 1, indicating high uncertainty (the maximum entropy for a binary variable is 1).
Information gain measures how much the entropy decreases when we split the data based on a particular attribute.
Formal: for a dataset \(D\) and attribute \(X\):
\[G(D,X) = H(D) - \sum\limits_{x \in V(X)} \frac{|D_x|}{|D|} H(D_x)\]
where:
The attribute with the highest information gain is the most important attribute.
For Bob’s skiing data, let’s calculate the information gain for the Snow near attribute.
First, split the data based on Snow near”:
The information gain is:
\(G(D,\text{Snow near}) = 0.994 - \frac{4}{11} \cdot 0 - \frac{7}{11} \cdot 0.863 = 0.445\)
Similarly, we can calculate:
Since “Snow near” has the highest information gain, we select it as the root node.
After selecting Snow near as our root, we continue the process recursively for each subset.
For \(D_{yes}\) (all examples have Go Skiing = yes):
This is a pure node, so we’re done with this branch
For \(D_{no}\) (mixed yes and no):
%%| fig-cap: "Decision tree for Bob's skiing decisions" %%| fig-width: 8 %%| fig-height: 6 graph TD A["Snow near?"] -->|Yes| B["Yes"] A -->|No| C["Weekend?"] C -->|Yes| D["Sun?"] C -->|No| E["No"] D -->|Yes| F["(2 Yes, 1 No)"] D -->|No| G["No"] %% Apply Arial font to all elements including edges classDef default font-family:Arial,font-size:14px; linkStyle default font-family:Arial,font-size:12px,fill:none,stroke-width:1px; %% Node styling classDef decision fill:#f9f9f9,stroke:#333,stroke-width:1px,font-family:Arial; classDef yes fill:#d4f4d4,stroke:#060,stroke-width:1px,color:#060,font-family:Arial; classDef no fill:#f4d4d4,stroke:#600,stroke-width:1px,color:#600,font-family:Arial; classDef mixed fill:#f4f4d4,stroke:#660,stroke-width:1px,color:#660,font-family:Arial; class A,C,D decision; class B,F yes; class E,G no; class F mixed;
Instead of information gain, the Gini impurity is another common criterion:
\[\text{Gini}(X) = \sum\limits_{x \in V(X)} p_x(1-p_x)\]
Where \(V(X)\) are the class values in node \(X\) and \(p_x\) is the probability of class \(x\).
Properties:
For numerical attributes:
In each recursive step of decision tree learning, there are four cases to consider:
This process continues until all branches end in leaf nodes.
Decision trees face a fundamental trade-off between data fit and generalization.
Pruning reduces the size of decision trees to prevent overfitting.
Pre-pruning
Stop growing the tree while building to limit maximum depth.
Post-pruning
Build the full tree, then remove sections
Both approaches improve generalization by reducing model complexity.
To properly evaluate a decision tree model:
1. Divide your data into separate sets
2. Measure accuracy on the test set
3. Use cross-validation2 for more robust assessment
As the training set grows, prediction quality usually increases, then plateaus.
Decision trees are used across many domains:
Their simplicity and interpretability make them particularly valuable when decisions need to be explained to stakeholders.
Ensemble methods address the limitations of single decision trees by combining multiple trees.
These ensemble approaches often dramatically outperform single decision trees while retaining some interpretability.
Strengths | Weaknesses |
---|---|
Easy to understand and interpret | Can create overly complex trees that don’t generalize well |
Minimal data preparation required | Small changes in data can result in a very different tree |
Handles both numerical and categorical data | Biased toward attributes with more levels |
Handles missing values well | Struggles with diagonal decision boundaries |
Computationally inexpensive | Generally lower accuracy than ensemble methods |
Understanding these trade-offs helps in choosing when to use decision trees.
Decision trees balance performance and interpretability, making them a valuable tool in any data scientist’s toolkit.
Create the decision tree by applying the divide-and-conquer approach on the restaurant examples.
Compare the naive tree with the tree gained by applying the divide-and-conquer heuristic. What differences do you see?.
Explain how decision trees can be used to create a learning agent. Relate your answers to the components outlined in the lecture notes.
We never test the same attribute twice along one path in a decision tree. Why not?
Consider the following dataset about weather conditions and whether tennis matches were played:
Day | Outlook | Temperature | Humidity | Windy | Play Tennis |
---|---|---|---|---|---|
1 | Sunny | Hot | High | No | No |
2 | Sunny | Hot | High | Yes | No |
3 | Overcast | Hot | High | No | Yes |
4 | Rain | Mild | High | No | Yes |
5 | Rain | Cool | Normal | No | Yes |
6 | Rain | Cool | Normal | Yes | No |
7 | Overcast | Cool | Normal | Yes | Yes |
8 | Sunny | Mild | High | No | No |
9 | Sunny | Cool | Normal | No | Yes |
10 | Rain | Mild | Normal | No | Yes |
11 | Sunny | Mild | Normal | Yes | Yes |
12 | Overcast | Mild | High | Yes | Yes |
13 | Overcast | Hot | Normal | No | Yes |
14 | Rain | Mild | High | Yes | No |
Hint
For the entropy calculation, count how many Yes and No instances there are in the Play Tennis column. Remember that the entropy formula is:
\[H(X) = - \sum\limits_{x \in V(X)} p_x \cdot \log_2 (p_x)\]
For information gain, you’ll need to split the data based on each attribute value and calculate the weighted entropy.
Consider the following dataset for predicting credit risk based on income and debt levels:
Customer ID | Income (1000€) | Debt (1000€) | Credit Risk |
---|---|---|---|
1 | 45 | 10 | Low |
2 | 32 | 12 | Low |
3 | 85 | 15 | Low |
4 | 38 | 20 | High |
5 | 48 | 28 | High |
6 | 29 | 18 | High |
7 | 56 | 5 | Low |
8 | 22 | 10 | High |
9 | 70 | 8 | Low |
10 | 35 | 25 | High |
Hint
For numerical attributes, consider threshold values that are midpoints between adjacent values in the sorted list of values. For example, if you have values 10, 15, and 20, the potential thresholds would be 12.5 and 17.5.
Consider a two-dimensional feature space with two attributes: x₁ and x₂. The following points represent different classes:
Hint
Start by identifying the best horizontal or vertical split that separates the classes. Then, apply the same process to each resulting region. The decision boundaries will form axis-parallel lines in the 2D space.
Consider the following decision tree trained on a small dataset:
graph TD A[Age ≤ 30?] -->|Yes| B[Income ≤ 40K?] A -->|No| C[Income ≤ 60K?] B -->|Yes| D[Buy = No<br>3/4 correct] B -->|No| E[Education?] C -->|Yes| F[Education?] C -->|No| G[Buy = Yes<br>5/6 correct] E -->|High| H[Buy = Yes<br>2/2 correct] E -->|Low| I[Buy = No<br>1/1 correct] F -->|High| J[Buy = Yes<br>3/4 correct] F -->|Low| K[Buy = No<br>2/3 correct]
The tree makes some errors on the training data, as indicated by the fractions (e.g., “3/4 correct” means the node correctly classifies 3 out of 4 training examples that reach that node).
You also have a validation set with the following distribution:
Node | Validation examples | Correctly classified |
---|---|---|
D | 8 | 5 |
H | 3 | 2 |
I | 2 | 1 |
J | 6 | 3 |
K | 5 | 3 |
G | 10 | 7 |
Hint
In reduced error pruning, replace a node with its most common class if doing so doesn’t decrease accuracy on the validation set. Start with the leaf nodes and work your way up. For each non-leaf node, calculate the accuracy before and after replacing it with a leaf.
Consider the following dataset for customer churn prediction:
Customer | Age | Usage | Contract Length | Churn |
---|---|---|---|---|
1 | Young | Low | Month-to-month | Yes |
2 | Young | Medium | Year | No |
3 | Young | High | Month-to-month | No |
4 | Middle | Low | Month-to-month | Yes |
5 | Middle | Medium | Year | Yes |
6 | Middle | High | Month-to-month | No |
7 | Senior | Low | Month-to-month | Yes |
8 | Senior | Medium | Year | No |
9 | Senior | High | Year | No |
10 | Senior | High | Month-to-month | Yes |
Hint
The Gini index for a node with class distribution \((p_1, p_2, ..., p_k)\) is calculated as:
\[Gini(D) = 1 - \sum_{i=1}^{k} p_i^2\]
For a binary classification like “Churn” (Yes/No), this simplifies to:
\[Gini(D) = 1 - (p_{yes}^2 + p_{no}^2)\]
The Gini gain is similar to information gain: it’s the difference between the parent node’s Gini index and the weighted average of the children nodes’ Gini indices.
\(\log_2 (|V(X)|)\) describes the upper bound of entropy — the maximum possible entropy for a random variable \(X\) is \(\log_2 (|V(X)|)\)
Multiple train-test splits are created and averaged.