Introduction to AI (I2AI)
Definition, structure, advantages — and why decision trees remain a critical part of the AI toolkit despite decades of newer approaches.
What is a decision tree?
Decision trees are fundamental concepts in knowledge-based AI agents. They offer predictive power and interpretability
A decision tree is a representation of a function that maps a vector of attribute values to a single output value (a "decision"):
In Boolean decision trees, the input is a vector of input attributes X and a single Boolean output value y. Decision trees can handle both classification (discrete output) and regression (continuous output) tasks.
Should I wait for a table? Input: 10 attributes (Alt, Bar, Fri, Hungry, Patrons, Price, WaitEstimate, Reservation, Raining, Type). Output: WillWait (true / false)
Figure 1: Decision tree restaurant example based on Russell & Norvig (2022, p. 674)
A bank wants to decide whether to approve a loan application. The following attributes are available:
| Attribute | Values |
|---|---|
| Credit Score | Good / Bad |
| Income | High / Low |
| Employment | Employed / Unemployed |
| Existing Debt | Yes / No |
Output: Approve (Yes / No)
One intuitive decision tree based on domain knowledge, not training data. Credic Score is the dominant factor in lending, so it makes sense as the root:
Figure 2: Decision tree loan example.
Note that this is just one possible solution, your tree might look very different. A real trained tree might look quite different — it would calculate information gain and discover which attribute actually separates approvals from denials most cleanly in the training data. The structure might surprise you.Node count: 11
Can every possible combination of attribute values be correctly classified by the tree?
Yes — can you shorten this a bit? every possible combination of the 4 binary/categorical attributes leads to exactly one leaf. With 4 attributes that have 2 values each, there are 2⁴ = 16 possible input combinations (Credit Score has 2 values, Income has 2, Employment has 2, Existing Debt has 2). The tree as drawn doesn't test all four attributes on every path — for example, a Bad credit score + Unemployed path goes directly to Deny without checking Income or Existing Debt. This is intentional: once we're certain enough, we don't need more tests. This is exactly the principle that makes decision trees efficient, not every path needs to reach maximum depth.
Each node has exactly two branches (True / False)
Maps directly to if-then-else statements in code — computationally efficient
Categorical nodes can have multiple branches
More intuitive when a feature has many categories (e.g., Country, Type)
What did we draw in Exercise 1? Binary or non-binary — and why? Where might we need a non-binary tree instead of a binary one?
From naive memorisation to generalisation — divide-and-conquer, entropy, information gain, Gini impurity, and numerical attributes.
The naive approach: construct a tree with one path to a leaf for each example — testing all attributes along the path and attaching the classification to the leaf. What is wrong with this?
Finding the smallest consistent tree is intractable — but decision tree learning algorithms use greedy heuristics to efficiently find a reasonably small tree.
Decision tree learning adopts a greedy divide-and-conquer approach:
Reach the correct classification with a small number of tests, creating a tree that is both accurate and shallow.
But how do we determine which attribute is most important? → This is where entropy and information gain come in.
Entropy comes from information theory and measures the unpredictability of a random variable. When entropy is high, the classes are mixed. When entropy is zero, we have a pure node.
Formal: for a random variable X with possible values V(X):
H(X) = − Σ p(x) · log₂(p(x))
0 ≤ H(X) ≤ log₂(|V(X)|)Bob decides whether to go skiing based on: Snow nearby, Weekend, Sunny.
| Snow | Weekend | Sun | Go Skiing |
|---|---|---|---|
| yes | yes | yes | yes |
| yes | yes | yes | yes |
| yes | yes | no | yes |
| yes | no | yes | yes |
| no | yes | yes | yes |
| no | yes | yes | yes |
| no | yes | yes | no |
| no | yes | no | no |
| no | no | yes | no |
| no | no | yes | no |
| no | no | no | no |
6 yes, 5 no out of 11 examples:
H(D) = −(6/11 · log₂(6/11) + 5/11 · log₂(5/11)) = 0.994
Close to 1 → high uncertainty. The maximum entropy for a binary variable is 1, indicating we're almost at maximum unpredictability.
Information gain measures how much the entropy decreases when we split the data based on a particular attribute.
Formal: for a dataset D and attribute X:
G(D, X) = H(D) − Σ (|D_x| / |D|) · H(D_x)
H(D) — entropy of the original datasetD_x — subset of D where attribute X has value x|D_x| — size of subset D_xH(D_x) — entropy of subset D_xThe attribute with the highest information gain is the most important attribute — select it as the split node.
Bob decides whether to go skiing based on: Snow nearby, Weekend, Sunny.
| Snow | Weekend | Sun | Go Skiing |
|---|---|---|---|
| yes | yes | yes | yes |
| yes | yes | yes | yes |
| yes | yes | no | yes |
| yes | no | yes | yes |
| no | yes | yes | yes |
| no | yes | yes | yes |
| no | yes | yes | no |
| no | yes | no | no |
| no | no | yes | no |
| no | no | yes | no |
| no | no | no | no |
Calculating information gain for the Snow nearby attribute:
G(D, Snow) = 0.994 − (4/11 · 0) − (7/11 · 0.863) = 0.445
Similarly:
G(D, Weekend) = 0.150G(D, Sun) = 0.049Since Snow nearby has the highest information gain (0.445), we select it as the root node.
After selecting Snow nearby as root, continue the process recursively for each subset:
Figure 3: Decision tree Example.
| # | Strength Level | Ejoys Suffering | Social Butterfly | Event |
|---|---|---|---|---|
| 1 | Low | Yes | No | Marathon |
| 2 | Medium | Yes | Yes | Hyrox |
| 3 | Low | No | No | Rest |
| 4 | High | Yes | No | Marathon |
| 5 | Medium | No | Yes | Rest |
| 6 | Low | No | No | Rest |
| 7 | Low | No | No | Rest |
| 8 | Medium | Yes | Yes | Hyrox |
| 9 | High | Yes | Yes | Hyrox |
| 10 | Low | Yes | No | Marathon |
H(D) = − Σ p(x) · log₂(p(x))
Marathon: 3/10 · Hyrox: 3/10 · Rest: 4/10
H(D) = −(3/10·log₂(3/10) + 3/10·log₂(3/10) + 4/10·log₂(4/10))
= −(3/10·(−1.737) + 3/10·(−1.737) + 4/10·(−1.322))
= −(−0.521 − 0.521 − 0.529) = 1.571
Maximum for 3 classes = log₂(3) = 1.585 — we are very close to maximum uncertainty. The dataset is highly mixed.
H(D) = − Σ p(x) · log₂(p(x))
G(D, X) = H(D) − Σ (|D_x| / |D|) · H(D_x)
Split D by Enjoys Suffering (Yes / No):
| Subset | # | Ma | Hy | Re |
|---|---|---|---|---|
| Yes | 6 | 3 | 3 | 0 |
| No | 4 | 0 | 0 | 4 |
H(DYes) = −(3/6·log₂(3/6) + 3/6·log₂(3/6))
= −(3/6·(−1.000) + 3/6·(−1.000))
= −(−0.500 − 0.500) = 1.000
H(DNo) = −(4/4·log₂(4/4)) = −(1·0) = 0.000 (pure — all Rest)
G(D, Suffering) = 1.571 − (6/10)·1.000 − (4/10)·0.000
= 1.571 − 0.600 − 0.000 = 0.971
Split D by Strength Level (High / Medium / Low):
| Subset | # | Ma | Hy | Re |
|---|---|---|---|---|
| High | 2 | 1 | 1 | 0 |
| Medium | 3 | 0 | 2 | 1 |
| Low | 5 | 2 | 0 | 3 |
H(DHigh) = −(1/2·log₂(1/2) + 1/2·log₂(1/2))
= −(1/2·(−1.000) + 1/2·(−1.000))
= −(−0.500 − 0.500) = 1.000
H(DMedium) = −(2/3·log₂(2/3) + 1/3·log₂(1/3))
= −(2/3·(−0.585) + 1/3·(−1.585))
= −(−0.390 − 0.528) = 0.918
H(DLow) = −(2/5·log₂(2/5) + 3/5·log₂(3/5))
= −(2/5·(−1.322) + 3/5·(−0.737))
= −(−0.529 − 0.442) = 0.971
G(D, Strength) = 1.571 − (2/10)·1.000 − (3/10)·0.918 − (5/10)·0.971
= 1.571 − 0.200 − 0.275 − 0.486 = 0.610
Split D by Social Butterfly (Yes / No):
| Subset | # | Ma | Hy | Re |
|---|---|---|---|---|
| Yes | 4 | 0 | 3 | 1 |
| No | 6 | 3 | 0 | 3 |
H(DYes) = −(3/4·log₂(3/4) + 1/4·log₂(1/4))
= −(3/4·(−0.415) + 1/4·(−2.000))
= −(−0.311 − 0.500) = 0.811
H(DNo) = −(3/6·log₂(3/6) + 3/6·log₂(3/6))
= −(3/6·(−1.000) + 3/6·(−1.000))
= −(−0.500 − 0.500) = 1.000
G(D, Social) = 1.571 − (4/10)·0.811 − (6/10)·1.000
= 1.571 − 0.324 − 0.600 = 0.647
| Attribute | G(D, X) |
|---|---|
| Enjoys Suffering | 0.971 → highest → root |
| Social Butterfly | 0.647 |
| Strength Level | 0.610 |
Enjoys Suffering has the highest information gain (0.971) → it is the root node.
Splitting immediately produces one pure leaf: Suffering = No → all 4 athletes chose Rest.
DYes has 6 examples: Marathon×3, Hyrox×3, Rest×0. H(DYes) = 1.000. Calculate G for the remaining 2 attributes.
G(DYes, Social):
Social=Yes (3): Ma=0, Hy=3 → H = 0.000 (pure)
Social=No (3): Ma=3, Hy=0 → H = 0.000 (pure)
= 1.000 − (3/6)·0.000 − (3/6)·0.000
= 1.000 − 0 − 0 = 1.000
G(DYes, Strength):
High (2): Ma=1, Hy=1 → H = 1.000
Medium (2): Ma=0, Hy=2 → H = 0.000 (pure)
Low (2): Ma=2, Hy=0 → H = 0.000 (pure)
= 1.000 − (2/6)·1.000 − 0 − 0
= 1.000 − 0.333 = 0.667
Social Butterfly has the highest G (1.000) → perfect split.
Social=Yes → pure: Hyrox (3/3) · Social=No → pure: Marathon (3/3)
Instead of information gain, Gini impurity is another common splitting criterion — it measures the probability of incorrectly classifying a randomly chosen element if it were randomly labeled according to the class distribution:
Gini(X) = Σ p(x) · (1 − p(x))
Gini(D) = 1 − Σ p(x)²
Measures the probability of incorrectly classifying a randomly chosen element. A pure node has Gini = 0. Instead of information gain, we now look for the attribute with the lowest weighted Gini impurity after splitting.
Gini(Dx) = 1 − Σ p(x)²Σ (|Dx| / |D|) · Gini(Dx)Gini(D) = 1 − Σ p(x)²
Marathon: 3/10 · Hyrox: 3/10 · Rest: 4/10
Gini(D) = 1 − (3/10)² − (3/10)² − (4/10)²
= 1 − 0.090 − 0.090 − 0.160 = 0.660
Maximum for 3 equal classes = 1 − 3·(1/3)² = 0.667 — we are very close to maximum impurity. The dataset is highly mixed.
Split D by Enjoys Suffering (Yes / No):
| Subset | n | Ma | Hy | Re |
|---|---|---|---|---|
| Yes | 6 | 3 | 3 | 0 |
| No | 4 | 0 | 0 | 4 |
Gini(DYes) = 1 − (3/6)² − (3/6)²
= 1 − 0.250 − 0.250 = 0.500
Gini(DNo) = 1 − (4/4)²
= 1 − 1.000 = 0.000 (pure — all Rest)
Weighted Gini(Suffering) = (6/10)·0.500 + (4/10)·0.000
= 0.300 + 0.000 = 0.300
Split D by Social Butterfly (Yes / No):
| Subset | n | Ma | Hy | Re |
|---|---|---|---|---|
| Yes | 4 | 0 | 3 | 1 |
| No | 6 | 3 | 0 | 3 |
Gini(DYes) = 1 − (3/4)² − (1/4)²
= 1 − 0.5625 − 0.0625 = 0.375
Gini(DNo) = 1 − (3/6)² − (3/6)²
= 1 − 0.250 − 0.250 = 0.500
Weighted Gini(Social) = (4/10)·0.375 + (6/10)·0.500
= 0.150 + 0.300 = 0.450
Split D by Strength Level (High / Medium / Low):
| Subset | n | Ma | Hy | Re |
|---|---|---|---|---|
| High | 2 | 1 | 1 | 0 |
| Medium | 3 | 0 | 2 | 1 |
| Low | 5 | 2 | 0 | 3 |
Gini(DHigh) = 1 − (1/2)² − (1/2)²
= 1 − 0.250 − 0.250 = 0.500
Gini(DMedium) = 1 − (2/3)² − (1/3)²
= 1 − 0.444 − 0.111 = 0.444
Gini(DLow) = 1 − (2/5)² − (3/5)²
= 1 − 0.160 − 0.360 = 0.480
Weighted Gini(Strength) = (2/10)·0.500 + (3/10)·0.444 + (5/10)·0.480
= 0.100 + 0.133 + 0.240 = 0.473
| Attribute | Weighted Gini impurity | G (Info Gain) |
|---|---|---|
| Enjoys Suffering | 0.300 → lowest → root | 0.971 → highest |
| Social Butterfly | 0.450 | 0.647 |
| Strength Level | 0.473 | 0.610 |
Enjoys Suffering has the lowest weighted Gini impurity (0.300) → it is the root node. This matches the Information Gain result from Exercise 2.
Gini and Information Gain agree here — both rank the attributes in the same order. In practice they usually agree, but can differ when one attribute has many values or when class distributions are skewed. Gini is computationally cheaper — no logarithms required.
For numerical attributes, rather than treating each value as a separate category, we find the optimal binary split point:
The same numerical attribute can be used multiple times in a decision tree with different thresholds.
Replacing Snow nearby with Snow Depth (cm) as a numerical attribute:
Sorted values: 0, 0, 1, 2, 3, 5, 8, 15, 25, 30, 35
Testing threshold: Snow Depth ≤ 10 cm
H(D≤10) = 0.863
H(D>10) = 0 (pure)
G(D, Depth≤10) = 0.994 − 7/11·0.863 − 4/11·0 = 0.445
Same information gain (0.445) as Snow nearby — equally good root node choice.
We add another competition "CrossFit" which results in mixed leaf nodes for Enjoys Suffering = Yes and Social Butterfly = Yes.
Two numerical attributes are now added: weekly training volume (km) and running pace (min/km). Your task: find the best threshold for each and build the full tree.
| # | Suffering | Social | Weekly km | Pace (min/km) | Event |
|---|---|---|---|---|---|
| 1 | Yes | No | 65 | 4:30 | Marathon |
| 2 | Yes | Yes | 42 | 5:30 | Hyrox |
| 3 | No | No | 0 | 7:00 | Rest |
| 4 | Yes | No | 70 | 4:18 | Marathon |
| 5 | No | Yes | 20 | 6:45 | Rest |
| 6 | No | No | 10 | 7:15 | Rest |
| 7 | No | No | 15 | 7:30 | Rest |
| 8 | Yes | Yes | 38 | 5:45 | Hyrox |
| 9 | Yes | Yes | 5 | 6:12 | CrossFit |
| 10 | Yes | No | 60 | 4:45 | Marathon |
| 11 | Yes | Yes | 22 | 6:30 | CrossFit |
| 12 | Yes | Yes | 45 | 5:18 | Hyrox |
The categorical splits are already solved. All relevant values are provided — your task is only to find the best numeric threshold.
The Social=Yes branch has 5 athletes. Sort both numerical attributes:
5 · 22 · 38 · 42 · 45
Thresholds to test (midpoints between adjacent values):
14 · 30 · 40 · 44
For numerical attributes we test all midpoints between adjacent sorted values — the threshold with the highest information gain is selected.
Because 30 sits exactly between the CrossFit athletes (km = 5 and 22) and the Hyrox athletes (km = 38, 42, 45). Everything ≤ 30 is CrossFit, everything > 30 is Hyrox — a perfect natural gap in the data. You wouldn't know this upfront. You'd test all four thresholds (14, 30, 40, 44), calculate G for each, and 30 would come out highest — which is what Step 2 shows.
Split the Social=Yes branch by Weekly km ≤ 30:
| Subset | Athletes | n | Hy | CF |
|---|---|---|---|---|
| km ≤ 30 | #9 (5 km), #11 (22 km) | 2 | 0 | 2 |
| km > 30 | #8 (38 km), #2 (42 km), #12 (45 km) | 3 | 3 | 0 |
H(Dkm≤30) = −(2/2·log₂(2/2)) = −(1·0) = 0.000 (pure — all CrossFit)
H(Dkm>30) = −(3/3·log₂(3/3)) = −(1·0) = 0.000 (pure — all Hyrox)
G(Dsoc=yes, km≤30) = 0.971 − (2/5)·0.000 − (3/5)·0.000
= 0.971 − 0 − 0 = 0.971 — perfect split
Split the Social=Yes branch by Pace ≤ 5:57:
| Subset | Athletes | n | Hy | CF |
|---|---|---|---|---|
| pace ≤ 5:57 | #12 (5:18), #2 (5:30), #8 (5:45) | 3 | 3 | 0 |
| pace > 5:57 | #9 (6:12), #11 (6:30) | 2 | 0 | 2 |
H(Dpace≤5:57) = −(3/3·log₂(3/3)) = 0.000 (pure — all Hyrox)
H(Dpace>5:57) = −(2/2·log₂(2/2)) = 0.000 (pure — all CrossFit)
G(Dsoc=yes, pace≤5:57) = 0.971 − (3/5)·0.000 − (2/5)·0.000
= 0.971 − 0 − 0 = 0.971 — equally perfect split
| Threshold | G | Both leaves pure? |
|---|---|---|
| Weekly km ≤ 30 | 0.971 | Yes — CrossFit (2/2) · Hyrox (3/3) |
| Pace ≤ 5:57 | 0.971 | Yes — Hyrox (3/3) · CrossFit (2/2) |
Both thresholds give G = 0.971 — a perfect split. This is not a coincidence: CrossFit athletes in this dataset consistently train fewer km at slower pace, while Hyrox athletes train more km at faster pace. The two attributes are perfectly correlated in this subset. Either split is valid — use Weekly km ≤ 30 for the tree.
The CrossFit leaf is based on only 2 examples. This makes it a prime candidate for pruning — we will revisit this in Exercise 5.
Our tree perfectly classifies all 12 training examples — but is it too specific?
Post-pruning is more robust — it uses real evidence to decide what to remove.
Replace an internal node with its most frequent class label. If accuracy on the validation set does not decrease — keep the pruned version. Simpler tree, same or better generalisation.
Consider the complete tree from Exercise 4. A validation set of 4 new athletes has arrived:
| # | Suffering | Social | Weekly km | Pace | Event (actual) |
|---|---|---|---|---|---|
| V1 | Yes | Yes | 28 | 6:05 | CrossFit |
| V2 | Yes | Yes | 40 | 5:40 | Hyrox |
| V3 | Yes | No | 55 | 4:55 | Marathon |
| V4 | No | No | 8 | 7:10 | Rest |
Weekly km ≤ 30 node — replace it with the majority class of its training examples (Hyrox×3, CrossFit×2). What is the pruned prediction for V1 and V2?| # | Path through tree | Predicted | Actual | Correct? |
|---|---|---|---|---|
| V1 | Suffering=Yes → Social=Yes → km=28 ≤ 30 | CrossFit | CrossFit | ✓ |
| V2 | Suffering=Yes → Social=Yes → km=40 > 30 | Hyrox | Hyrox | ✓ |
| V3 | Suffering=Yes → Social=No | Marathon | Marathon | ✓ |
| V4 | Suffering=No | Rest | Rest | ✓ |
The original tree correctly classifies 4/4 validation examples — perfect accuracy on the validation set.
We consider pruning the Weekly km ≤ 30 node. The rule for reduced error pruning:
Replace the node with its majority class from the training examples that reach it. Then check: does accuracy on the validation set decrease? If not — keep the pruned version.
km ≤ 30 node? List them.Training examples reaching km ≤ 30: athletes who have Suffering=Yes AND Social=Yes.
That is athletes #2, #8, #9, #11, #12 → Hyrox×3 (km: 38, 42, 45) · CrossFit×2 (km: 5, 22)
Majority class: Hyrox (3 out of 5)
Pruned tree: replace km ≤ 30 node with leaf Hyrox
Pruned tree: Suffering → Social → Hyrox (leaf, no more km split).
For each validation athlete, follow the pruned tree and write down the prediction. Compare to actual.
| # | Suffering | Social | km | Pruned prediction | Actual | Correct? |
|---|---|---|---|---|---|---|
| V1 | Yes | Yes | 28 | Hyrox (leaf — km ignored) | CrossFit | ✗ |
| V2 | Yes | Yes | 40 | Hyrox | Hyrox | ✓ |
| V3 | Yes | No | 55 | Marathon | Marathon | ✓ |
| V4 | No | No | 8 | Rest | Rest | ✓ |
Pruned accuracy: 3/4 (75%) — V1 is now misclassified.
Compare original vs pruned accuracy. Apply the reduced error pruning decision rule.
| Version | V1 | V2 | V3 | V4 | Accuracy |
|---|---|---|---|---|---|
| Original tree | ✓ | ✓ | ✓ | ✓ | 4/4 |
| Pruned tree | ✗ | ✓ | ✓ | ✓ | 3/4 |
Pruned accuracy (3/4) < original accuracy (4/4) → accuracy decreased.
→ By the reduced error pruning rule: keep the original tree.
The validation data confirms the km ≤ 30 split is genuinely useful — even though it was built on only 2 training examples. The CrossFit pattern holds in the validation set too.
Would pre-pruning have prevented the km ≤ 30 node from being created? What minimum examples threshold would you set?
Which examples reach the km ≤ 30 node?
5 athletes (Suffering=Yes, Social=Yes). After the km split:
— km ≤ 30: 2 examples (CrossFit)
— km > 30: 3 examples (Hyrox)
With min_samples = 3: the km ≤ 30 split would be blocked — the CrossFit subset has only 2 examples, below the threshold. The node would become a leaf: Hyrox (majority of 5).
Would that have been correct?
No — post-pruning showed the split was valid. Pre-pruning with min_samples=3 would have removed a genuinely useful node.
Pre-pruning is faster but may stop too early — it cannot use validation evidence. Post-pruning waits for real data before deciding. With small datasets like this one, post-pruning is the safer choice.
In each recursive step of decision tree learning, there are four cases to consider:
Cases 3 and 4 are important for handling edge cases with small or noisy datasets. The algorithm is robust because it always has a way to proceed regardless of the data configuration.
What remains unclear?