Bayesian Networks & Probabilistic Inference

Warm Up — Smart Glasses

Discussion

📰 In the news — 2026

Google I/O, May 19: Project Aura XR smart glasses confirmed for global launch — 90g, 70° display, Gemini AI, three cameras. It "sees and hears what you do" all day.

February 2026: Swedish investigation reveals Meta Ray-Ban glasses sent intimate footage — sex, nudity, bathroom use, bank card numbers — to data annotators in Kenya. Workers described seeing it daily. Their job: label the footage so Meta's AI could learn. Meta sold 7 million pairs in 2025. The LED indicator on the frame told bystanders they were being filmed. It did not tell them a human in Nairobi might be watching. Meta terminated the contract. 1,108 workers lost their jobs.

💬 Discuss with a partner

Would you wear smart glasses? What would it take to trust them — or would you never?
The data pipeline: AI models need human-labeled training data. Footage from your glasses feeds that pipeline. Is that an acceptable trade-off for a better product?
Who is responsible? Meta? The annotators? The users who didn't read the privacy policy? The regulators who didn't act sooner?

    The Next Web (2026). Meta ends Sama contract after Kenyan workers report seeing intimate footage from Ray-Ban smart glasses users. May 5.
    Svenska Dagbladet & Göteborgs-Posten (2026). Joint investigation on Meta Ray-Ban data annotation. Feb 27.
    Wong, R. (2026). Project Aura XR Smart Glasses Are Legit. Gizmodo, May 19.
  

Decision Trees

Discussion

💬 Discussion

What is a decision tree?

Answer

Decision trees are fundamental concepts in knowledge-based AI agents. They offer predictive power and interpretability

represent a function that maps attribute values to a decision
use search to find a decision through a sequence of tests
create a model that is inherintly explainable

How Does a Decision Tree Look Like?

A decision tree is a representation of a function that maps a vector of attribute values to a single output value (a "decision"):

An internal node represents a test of a property
Branches are labeled with possible values of the test
Each leaf node specifies the value to be returned if that leaf is reached

In Boolean decision trees, the input is a vector of input attributes X and a single Boolean output value y. Decision trees can handle both classification (discrete output) and regression (continuous output) tasks.

Example: Restaurant Decision (Russell & Norvig, 2022)

Should I wait for a table?
Input: 10 attributes (Alt, Bar, Fri, Hungry, Patrons, Price, WaitEstimate, Reservation, Raining, Type).
Output: WillWait (true / false)

Figure 1: Decision tree restaurant example based on Russell & Norvig (2022, p. 674)

Exercise 1 — Draw a Decision Tree

Exercise

✏️ Exercise

A bank wants to decide whether to approve a loan application. The following attributes are available:

Attribute	Values
Credit Score	Good / Bad
Income	High / Low
Employment	Employed / Unemployed
Existing Debt	Yes / No

Output: Approve (Yes / No)

Draw a decision tree that classifies loan applications based on these attributes. Use your own intuition — there is no training set yet.
How many leaf nodes does your tree have? How many internal nodes?
Can every possible combination of attribute values be correctly classified by your tree?

Solution: Exercise 1 — Draw a Decision Tree

One intuitive decision tree based on domain knowledge, not training data. Credic Score is the dominant factor in lending, so it makes sense as the root:

          Figure 2: Decision tree loan example.
        
Note that this is just one possible solution, your tree might look very different. A real trained tree might look quite different — it would calculate information gain and discover which attribute actually separates approvals from denials most cleanly in the training data. The structure might surprise you.

Node count: 11

Root node: 1 - Credit score

Internal nodes (decision nodes): 4 - Credit Score, Income (left branch), Employment (right branch), and two Existing debt nodes

Leaf nodes: 6 - four on the bottom row (left branch) + Approve and Deny directly under Employment

Can every possible combination of attribute values be correctly classified by the tree?

Yes — can you shorten this a bit? every possible combination of the 4 binary/categorical attributes leads to exactly one leaf. With 4 attributes that have 2 values each, there are 2⁴ = 16 possible input combinations (Credit Score has 2 values, Income has 2, Employment has 2, Existing Debt has 2). The tree as drawn doesn't test all four attributes on every path — for example, a Bad credit score + Unemployed path goes directly to Deny without checking Income or Existing Debt. This is intentional: once we're certain enough, we don't need more tests. This is exactly the principle that makes decision trees efficient, not every path needs to reach maximum depth.

1.3 — Types of Decision Trees

Binary Decision Trees

Each node has exactly two branches (True / False)

For numeric features: "Is X ≤ threshold?" (e.g., "Pages Viewed ≤ 20?")
For categorical features: "Is X = value?" (e.g., "Referrer = Slashdot?")

Maps directly to if-then-else statements in code — computationally efficient

Non-Binary Decision Trees

Categorical nodes can have multiple branches

Each branch corresponds to one possible value
Often visually simpler for features with many values
Requires switch-case logic — less efficient

More intuitive when a feature has many categories (e.g., Country, Type)

💬 Reflect

What did we draw in Exercise 1? Binary or non-binary — and why? Where might we need a non-binary tree instead of a binary one?

What is Learning Again?

Recap

💬 Discussion

The naive approach: construct a tree with one path to a leaf for each example — testing all attributes along the path and attaching the classification to the leaf. What is wrong with this?

Answer

This correctly classifies all training examples — but doesn't generalise
It just memorises the observations
We need a tree that is: (1) consistent with the training set, and (2) as small as possible to promote generalisation

Finding the smallest consistent tree is intractable — but decision tree learning algorithms use greedy heuristics to efficiently find a reasonably small tree.

Divide-and-Conquer Strategy

Decision tree learning adopts a greedy divide-and-conquer approach:

Select the most important attribute to test at the root
Divide the training set into subsets corresponding to each value of that attribute
Recursively apply the same process to each subset

Goal

Reach the correct classification with a small number of tests, creating a tree that is both accurate and shallow.

Key Question

But how do we determine which attribute is most important? → This is where entropy and information gain come in.

Entropy

Entropy comes from information theory and measures the unpredictability of a random variable. When entropy is high, the classes are mixed. When entropy is zero, we have a pure node.

Formal: for a random variable X with possible values V(X):

H(X) = − Σ p(x) · log₂(p(x))

0 ≤ H(X) ≤ log₂(|V(X)|)
H(X) = 0 means complete certainty (pure node)
Maximum value occurs when all outcomes are equally likely

Maximum Entropy — Examples

Binary variable (yes/no): max H = log₂(2) = 1 bit — when P(yes) = P(no) = 0.5
4 possible values (card suit): max H = log₂(4) = 2 bits — each at probability 0.25
8 possible values: max H = log₂(8) = 3 bits

Entropy: Bob's Skiing Example

Example

Bob decides whether to go skiing based on: Snow nearby, Weekend, Sunny.

Snow	Weekend	Sun	Go Skiing
yes	yes	yes	yes
yes	yes	yes	yes
yes	yes	no	yes
yes	no	yes	yes
no	yes	yes	yes
no	yes	yes	yes
no	yes	yes	no
no	yes	no	no
no	no	yes	no
no	no	yes	no
no	no	no	no

Entropy of "Go Skiing"

6 yes, 5 no out of 11 examples:

H(D) = −(6/11 · log₂(6/11) + 5/11 · log₂(5/11)) = 0.994

Close to 1 → high uncertainty. The maximum entropy for a binary variable is 1, indicating we're almost at maximum unpredictability.

Information Gain

Information gain measures how much the entropy decreases when we split the data based on a particular attribute.

Formal: for a dataset D and attribute X:

G(D, X) = H(D) − Σ (|D_x| / |D|) · H(D_x)

H(D) — entropy of the original dataset
D_x — subset of D where attribute X has value x
|D_x| — size of subset D_x
H(D_x) — entropy of subset D_x

Decision Rule

The attribute with the highest information gain is the most important attribute — select it as the split node.

Information Gain: Bob's Skiing Example

Example

Bob decides whether to go skiing based on: Snow nearby, Weekend, Sunny.

Snow	Weekend	Sun	Go Skiing
yes	yes	yes	yes
yes	yes	yes	yes
yes	yes	no	yes
yes	no	yes	yes
no	yes	yes	yes
no	yes	yes	yes
no	yes	yes	no
no	yes	no	no
no	no	yes	no
no	no	yes	no
no	no	no	no

Information Gain of "Go Skiing"

Calculating information gain for the Snow nearby attribute:

D_yes: 4 examples, all Go Skiing = yes → entropy = 0 (pure node)
D_no: 7 examples, 2 yes and 5 no → entropy = 0.863

G(D, Snow) = 0.994 − (4/11 · 0) − (7/11 · 0.863) = 0.445

Similarly:

G(D, Weekend) = 0.150
G(D, Sun) = 0.049

Result

Since Snow nearby has the highest information gain (0.445), we select it as the root node.

How to Build a Decision Tree

After selecting Snow nearby as root, continue the process recursively for each subset:

D_yes — all examples have Go Skiing = yes → pure node, we're done
D_no — mixed yes and no → calculate information gain for remaining attributes on this subset:
- Weekend has higher gain (0.292) than Sun (0.169) → split on Weekend
- Continue recursively…

Decision tree for Go Skiing? — Snow nearby as root, branching through Weekend and Sun

Figure 3: Decision tree Example.

Exercise 2 — Which Race Should We Sign Up For?

Dataset

Exercise

You are training hard in your fitness class each week. A member of your team proposes to attend a race. The group can't agree. So you do what any data-driven athlete would do: collect data from 10 athletes who already chose — and build a decision tree to let the model decide. Calculate (1) entropy of the dataset, (2) information gain of each attribute, (3) identify the root node.

#	Strength Level	Ejoys Suffering	Social Butterfly	Event
1	Low	Yes	No	Marathon
2	Medium	Yes	Yes	Hyrox
3	Low	No	No	Rest
4	High	Yes	No	Marathon
5	Medium	No	Yes	Rest
6	Low	No	No	Rest
7	Low	No	No	Rest
8	Medium	Yes	Yes	Hyrox
9	High	Yes	Yes	Hyrox
10	Low	Yes	No	Marathon

Step 1 — Calculate H(D)

Exercise

Recall

H(D) = − Σ p(x) · log₂(p(x))

✏️ Task

Count the examples per class. What is the probability of each?
Calculate H(D).
Is the entropy high or low relative to the maximum? What does it tell us?

Solution

Marathon: 3/10 · Hyrox: 3/10 · Rest: 4/10

H(D) = −(3/10·log₂(3/10) + 3/10·log₂(3/10) + 4/10·log₂(4/10))
= −(3/10·(−1.737) + 3/10·(−1.737) + 4/10·(−1.322))
= −(−0.521 − 0.521 − 0.529) = 1.571

Maximum for 3 classes = log₂(3) = 1.585 — we are very close to maximum uncertainty. The dataset is highly mixed.

Step 2 — Calculate Information Gain

Exercise

Recall

H(D) = − Σ p(x) · log₂(p(x))

G(D, X) = H(D) − Σ (|D_x| / |D|) · H(D_x)

✏️ Task — repeat for each of the 3 attributes

Split D by the attribute — how many Ma / Hy / Re fall into each subset?
For each subset
2.1 calculate entropy
2.2 calculate information gain G(D,X)
Note the result — compare all three G values at the end to identify the root node

Step 2 — G(D, Enjoys Suffering)

Exercise

Split D by Enjoys Suffering (Yes / No):

Subset	#	Ma	Hy	Re
Yes	6	3	3	0
No	4	0	0	4

Solution

H(D_Yes) = −(3/6·log₂(3/6) + 3/6·log₂(3/6))
= −(3/6·(−1.000) + 3/6·(−1.000))
= −(−0.500 − 0.500) = 1.000

H(D_No) = −(4/4·log₂(4/4)) = −(1·0) = 0.000 (pure — all Rest)

G(D, Suffering) = 1.571 − (6/10)·1.000 − (4/10)·0.000
= 1.571 − 0.600 − 0.000 = 0.971

Step 2 — G(D, Strength Level)

Exercise

Split D by Strength Level (High / Medium / Low):

Subset	#	Ma	Hy	Re
High	2	1	1	0
Medium	3	0	2	1
Low	5	2	0	3

Solution

H(D_High) = −(1/2·log₂(1/2) + 1/2·log₂(1/2))
= −(1/2·(−1.000) + 1/2·(−1.000))
= −(−0.500 − 0.500) = 1.000

H(D_Medium) = −(2/3·log₂(2/3) + 1/3·log₂(1/3))
= −(2/3·(−0.585) + 1/3·(−1.585))
= −(−0.390 − 0.528) = 0.918

H(D_Low) = −(2/5·log₂(2/5) + 3/5·log₂(3/5))
= −(2/5·(−1.322) + 3/5·(−0.737))
= −(−0.529 − 0.442) = 0.971

G(D, Strength) = 1.571 − (2/10)·1.000 − (3/10)·0.918 − (5/10)·0.971
= 1.571 − 0.200 − 0.275 − 0.486 = 0.610

Step 2 — G(D, Social Butterfly)

Exercise

Split D by Social Butterfly (Yes / No):

Subset	#	Ma	Hy	Re
Yes	4	0	3	1
No	6	3	0	3

Solution

H(D_Yes) = −(3/4·log₂(3/4) + 1/4·log₂(1/4))
= −(3/4·(−0.415) + 1/4·(−2.000))
= −(−0.311 − 0.500) = 0.811

H(D_No) = −(3/6·log₂(3/6) + 3/6·log₂(3/6))
= −(3/6·(−1.000) + 3/6·(−1.000))
= −(−0.500 − 0.500) = 1.000

G(D, Social) = 1.571 − (4/10)·0.811 − (6/10)·1.000
= 1.571 − 0.324 − 0.600 = 0.647

Step 2 — Which Attribute is the Root?

Solution

Attribute	G(D, X)
Enjoys Suffering	0.971 → highest → root
Social Butterfly	0.647
Strength Level	0.610

Result

Enjoys Suffering has the highest information gain (0.971) → it is the root node.
Splitting immediately produces one pure leaf: Suffering = No → all 4 athletes chose Rest.

Step 3 — Recurse: D_{Suffering=Yes}

Additional Exercise

D_Yes has 6 examples: Marathon×3, Hyrox×3, Rest×0. H(D_Yes) = 1.000. Calculate G for the remaining 2 attributes.

✏️ Task

Split D_Yes by Social Butterfly — count Ma/Hy per subset, calculate H, calculate G
Split D_Yes by Strength Level — same process
Which attribute should split D_Yes?

Solution

G(D_Yes, Social):

Social=Yes (3): Ma=0, Hy=3 → H = 0.000 (pure)
Social=No (3): Ma=3, Hy=0 → H = 0.000 (pure)

= 1.000 − (3/6)·0.000 − (3/6)·0.000
= 1.000 − 0 − 0 = 1.000

G(D_Yes, Strength):

High (2): Ma=1, Hy=1 → H = 1.000
Medium (2): Ma=0, Hy=2 → H = 0.000 (pure)
Low (2): Ma=2, Hy=0 → H = 0.000 (pure)

= 1.000 − (2/6)·1.000 − 0 − 0
= 1.000 − 0.333 = 0.667

Result

Social Butterfly has the highest G (1.000) → perfect split.
Social=Yes → pure: Hyrox (3/3) · Social=No → pure: Marathon (3/3)

Solution — The Complete Tree

Solution

Key takeaways

Suffering is root — highest G (0.971), immediately isolates all Rest cases
Social perfectly separates Marathon (No) from Hyrox (Yes) — G = 1.000, maximum possible
Strength was never needed — the tree is complete without it
All 3 leaf nodes are pure · 2 internal nodes · depth 2

Gini Impurity

Instead of information gain, Gini impurity is another common splitting criterion — it measures the probability of incorrectly classifying a randomly chosen element if it were randomly labeled according to the class distribution:

Gini(X) = Σ p(x) · (1 − p(x))

Properties

0 = all elements belong to the same class (pure)
Maximum when classes are equally likely
Lower value = "purer" node
Often used in CART (Classification and Regression Trees) algorithm

Entropy vs. Gini

Entropy measures uncertainty
Gini measures expected error rate
In practice both often yield similar trees
Gini is computationally more efficient — no logarithms

Exercise 3 — Decision Tree with Gini Impurity

Exercise

Recall — Gini Impurity

Gini(D) = 1 − Σ p(x)²

Measures the probability of incorrectly classifying a randomly chosen element. A pure node has Gini = 0. Instead of information gain, we now look for the attribute with the lowest weighted Gini impurity after splitting.

✏️ Task — repeat for each of the 3 attributes

Split D by the attribute — how many Ma / Hy / Re fall into each subset?
For each subset, calculate Gini(D_x) = 1 − Σ p(x)²
Calculate the weighted Gini impurity: Σ (|D_x| / |D|) · Gini(D_x)
Note the result — compare all three values at the end to identify the root node

Step 1 — Calculate Gini(D)

Exercise

Recall

Gini(D) = 1 − Σ p(x)²

✏️ Task

Count the examples per class. What is the probability of each?
Calculate Gini(D).
Is the Gini high or low relative to the maximum? What does it tell us?

Solution

Marathon: 3/10 · Hyrox: 3/10 · Rest: 4/10

Gini(D) = 1 − (3/10)² − (3/10)² − (4/10)²
= 1 − 0.090 − 0.090 − 0.160 = 0.660

Maximum for 3 equal classes = 1 − 3·(1/3)² = 0.667 — we are very close to maximum impurity. The dataset is highly mixed.

Step 2 — Weighted Gini for Enjoys Suffering

Exercise

Split D by Enjoys Suffering (Yes / No):

Subset	n	Ma	Hy	Re
Yes	6	3	3	0
No	4	0	0	4

✏️ Task

Calculate Gini(D_Yes) and Gini(D_No)
Calculate the weighted Gini = (6/10)·Gini(D_Yes) + (4/10)·Gini(D_No)

Solution

Gini(D_Yes) = 1 − (3/6)² − (3/6)²
= 1 − 0.250 − 0.250 = 0.500

Gini(D_No) = 1 − (4/4)²
= 1 − 1.000 = 0.000 (pure — all Rest)

Weighted Gini(Suffering) = (6/10)·0.500 + (4/10)·0.000
= 0.300 + 0.000 = 0.300

Step 2 — Weighted Gini for Social Butterfly

Exercise

Split D by Social Butterfly (Yes / No):

Subset	n	Ma	Hy	Re
Yes	4	0	3	1
No	6	3	0	3

Solution

Gini(D_Yes) = 1 − (3/4)² − (1/4)²
= 1 − 0.5625 − 0.0625 = 0.375

Gini(D_No) = 1 − (3/6)² − (3/6)²
= 1 − 0.250 − 0.250 = 0.500

Weighted Gini(Social) = (4/10)·0.375 + (6/10)·0.500
= 0.150 + 0.300 = 0.450

Step 2 — Weighted Gini for Strength Level

Exercise

Split D by Strength Level (High / Medium / Low):

Subset	n	Ma	Hy	Re
High	2	1	1	0
Medium	3	0	2	1
Low	5	2	0	3

Solution

Gini(D_High) = 1 − (1/2)² − (1/2)²
= 1 − 0.250 − 0.250 = 0.500

Gini(D_Medium) = 1 − (2/3)² − (1/3)²
= 1 − 0.444 − 0.111 = 0.444

Gini(D_Low) = 1 − (2/5)² − (3/5)²
= 1 − 0.160 − 0.360 = 0.480

Weighted Gini(Strength) = (2/10)·0.500 + (3/10)·0.444 + (5/10)·0.480
= 0.100 + 0.133 + 0.240 = 0.473

Step 3 — Which Attribute is the Root?

Solution

Attribute	Weighted Gini impurity	G (Info Gain)
Enjoys Suffering	0.300 → lowest → root	0.971 → highest
Social Butterfly	0.450	0.647
Strength Level	0.473	0.610

Result

Enjoys Suffering has the lowest weighted Gini impurity (0.300) → it is the root node. This matches the Information Gain result from Exercise 2.

Reflect

Gini and Information Gain agree here — both rank the attributes in the same order. In practice they usually agree, but can differ when one attribute has many values or when class distributions are skewed. Gini is computationally cheaper — no logarithms required.

Numerical Attributes

For numerical attributes, rather than treating each value as a separate category, we find the optimal binary split point:

Sort the values of the attribute in the training set
Consider all possible thresholds between adjacent values
Calculate information gain (or Gini) for each threshold
Select the threshold with the highest information gain

Key Insight

The same numerical attribute can be used multiple times in a decision tree with different thresholds.

Numerical Attributes: Snow Depth Example

Example

Replacing Snow nearby with Snow Depth (cm) as a numerical attribute:

Sorted values: 0, 0, 1, 2, 3, 5, 8, 15, 25, 30, 35

Testing threshold: Snow Depth ≤ 10 cm

≤ 10 cm: 7 examples (2 yes, 5 no)
> 10 cm: 4 examples (4 yes, 0 no)

H(D≤10) = 0.863
H(D>10) = 0 (pure)
G(D, Depth≤10) = 0.994 − 7/11·0.863 − 4/11·0 = 0.445

Decision tree: Snow Depth ≤ 10cm root, No → Go Skiing Yes, Yes → Weekend → Sun

Result

Same information gain (0.445) as Snow nearby — equally good root node choice.

Exercise 4 — Numeric Attributes: Which Race?

Dataset

Extension

We add another competition "CrossFit" which results in mixed leaf nodes for Enjoys Suffering = Yes and Social Butterfly = Yes.
Two numerical attributes are now added: weekly training volume (km) and running pace (min/km). Your task: find the best threshold for each and build the full tree.

#	Suffering	Social	Weekly km	Pace (min/km)	Event
1	Yes	No	65	4:30	Marathon
2	Yes	Yes	42	5:30	Hyrox
3	No	No	0	7:00	Rest
4	Yes	No	70	4:18	Marathon
5	No	Yes	20	6:45	Rest
6	No	No	10	7:15	Rest
7	No	No	15	7:30	Rest
8	Yes	Yes	38	5:45	Hyrox
9	Yes	Yes	5	6:12	CrossFit
10	Yes	No	60	4:45	Marathon
11	Yes	Yes	22	6:30	CrossFit
12	Yes	Yes	45	5:18	Hyrox

Exercise 4 — What We Already Know

Recap

The categorical splits are already solved. All relevant values are provided — your task is only to find the best numeric threshold.

✏️ Task

Sort the Weekly km values. What thresholds would you test?
Test km ≤ threshold — calculate H for each subset, then G
Test pace ≤ 5:57 — same process
Which threshold gives the highest G? Are both leaves pure?

Exercise 4 — Step 1: Sort Values and Identify Thresholds

Solution

The Social=Yes branch has 5 athletes. Sort both numerical attributes:

Weekly km — sorted

5 · 22 · 38 · 42 · 45

Thresholds to test (midpoints between adjacent values):

14 · 30 · 40 · 44

Recall

For numerical attributes we test all midpoints between adjacent sorted values — the threshold with the highest information gain is selected.

Because 30 sits exactly between the CrossFit athletes (km = 5 and 22) and the Hyrox athletes (km = 38, 42, 45). Everything ≤ 30 is CrossFit, everything > 30 is Hyrox — a perfect natural gap in the data. You wouldn't know this upfront. You'd test all four thresholds (14, 30, 40, 44), calculate G for each, and 30 would come out highest — which is what Step 2 shows.

Exercise 4 — Step 2: G(D_soc=yes, km ≤ 30)

Solution

Split the Social=Yes branch by Weekly km ≤ 30:

Subset	Athletes	n	Hy	CF
km ≤ 30	#9 (5 km), #11 (22 km)	2	0	2
km > 30	#8 (38 km), #2 (42 km), #12 (45 km)	3	3	0

Calculation

H(D_km≤30) = −(2/2·log₂(2/2)) = −(1·0) = 0.000 (pure — all CrossFit)

H(D_km>30) = −(3/3·log₂(3/3)) = −(1·0) = 0.000 (pure — all Hyrox)

G(D_soc=yes, km≤30) = 0.971 − (2/5)·0.000 − (3/5)·0.000
= 0.971 − 0 − 0 = 0.971 — perfect split

Exercise 4 — Step 3: G(D_soc=yes, pace ≤ 5:57)

Solution

Split the Social=Yes branch by Pace ≤ 5:57:

Subset	Athletes	n	Hy	CF
pace ≤ 5:57	#12 (5:18), #2 (5:30), #8 (5:45)	3	3	0
pace > 5:57	#9 (6:12), #11 (6:30)	2	0	2

Calculation

H(D_pace≤5:57) = −(3/3·log₂(3/3)) = 0.000 (pure — all Hyrox)

H(D_pace>5:57) = −(2/2·log₂(2/2)) = 0.000 (pure — all CrossFit)

G(D_soc=yes, pace≤5:57) = 0.971 − (3/5)·0.000 − (2/5)·0.000
= 0.971 − 0 − 0 = 0.971 — equally perfect split

Exercise 4 — Step 4: Which Threshold to Choose?

Solution

Threshold	G	Both leaves pure?
Weekly km ≤ 30	0.971	Yes — CrossFit (2/2) · Hyrox (3/3)
Pace ≤ 5:57	0.971	Yes — Hyrox (3/3) · CrossFit (2/2)

Result

Both thresholds give G = 0.971 — a perfect split. This is not a coincidence: CrossFit athletes in this dataset consistently train fewer km at slower pace, while Hyrox athletes train more km at faster pace. The two attributes are perfectly correlated in this subset. Either split is valid — use Weekly km ≤ 30 for the tree.

Pruning note

The CrossFit leaf is based on only 2 examples. This makes it a prime candidate for pruning — we will revisit this in Exercise 5.

Exercise 4 — The Complete Tree

Solution

Key takeaways

Suffering — root, isolates Rest immediately
Social=No → Marathon pure — marathoners train alone
Weekly km ≤ 30 separates CrossFit (low volume) from Hyrox (higher volume)
Pace ≤ 5:57 is an equally valid alternative split — same result, different attribute

3.2 — Pruning: Recap

Our tree perfectly classifies all 12 training examples — but is it too specific?

The Problem: Overfitting

A deeper tree fits training data more perfectly
But it may learn noise, not patterns
Result: poor generalisation to new athletes
Our CrossFit leaf is based on only 2 examples — how confident are we?

The Solution: Pruning

Pre-pruning: stop growing early — set minimum examples per node, maximum depth, or minimum IG threshold
Post-pruning: build the full tree, then remove nodes that do not improve performance on a validation set

Post-pruning is more robust — it uses real evidence to decide what to remove.

Reduced Error Pruning

Replace an internal node with its most frequent class label. If accuracy on the validation set does not decrease — keep the pruned version. Simpler tree, same or better generalisation.

Exercise 5 — Pruning

Exercise

Consider the complete tree from Exercise 4. A validation set of 4 new athletes has arrived:

#	Suffering	Social	Weekly km	Pace	Event (actual)
V1	Yes	Yes	28	6:05	CrossFit
V2	Yes	Yes	40	5:40	Hyrox
V3	Yes	No	55	4:55	Marathon
V4	No	No	8	7:10	Rest

✏️ Task

Run all 4 validation athletes through the tree. What does the tree predict for each? How many are correctly classified?
Apply reduced error pruning to the Weekly km ≤ 30 node — replace it with the majority class of its training examples (Hyrox×3, CrossFit×2). What is the pruned prediction for V1 and V2?
Does pruning increase or decrease accuracy on the validation set? Should we keep the pruned version?
Would pre-pruning have prevented this node from being created? What minimum examples threshold would you set?

Exercise 5 — Step 1: Run Validation Through the Tree

Solution

#	Path through tree	Predicted	Actual	Correct?
V1	Suffering=Yes → Social=Yes → km=28 ≤ 30	CrossFit	CrossFit	✓
V2	Suffering=Yes → Social=Yes → km=40 > 30	Hyrox	Hyrox	✓
V3	Suffering=Yes → Social=No	Marathon	Marathon	✓
V4	Suffering=No	Rest	Rest	✓

Result

The original tree correctly classifies 4/4 validation examples — perfect accuracy on the validation set.

Exercise 5 — Step 2: Apply Reduced Error Pruning

Solution

We consider pruning the Weekly km ≤ 30 node. The rule for reduced error pruning:

Recall — Reduced Error Pruning

Replace the node with its majority class from the training examples that reach it. Then check: does accuracy on the validation set decrease? If not — keep the pruned version.

🔎 Hint

Which training examples reach the km ≤ 30 node? List them.
Count the classes — what is the majority class?
Replace the node with that class label — draw the pruned tree.

Solution

Training examples reaching km ≤ 30: athletes who have Suffering=Yes AND Social=Yes.
That is athletes #2, #8, #9, #11, #12 → Hyrox×3 (km: 38, 42, 45) · CrossFit×2 (km: 5, 22)

Majority class: Hyrox (3 out of 5)

Pruned tree: replace km ≤ 30 node with leaf Hyrox

Exercise 5 — Step 3: Run Validation Through the Pruned Tree

Solution

Pruned tree: Suffering → Social → Hyrox (leaf, no more km split).

🔎 Hint

For each validation athlete, follow the pruned tree and write down the prediction. Compare to actual.

#	Suffering	Social	km	Pruned prediction	Actual	Correct?
V1	Yes	Yes	28	Hyrox (leaf — km ignored)	CrossFit	✗
V2	Yes	Yes	40	Hyrox	Hyrox	✓
V3	Yes	No	55	Marathon	Marathon	✓
V4	No	No	8	Rest	Rest	✓

Result

Pruned accuracy: 3/4 (75%) — V1 is now misclassified.

Exercise 5 — Step 4: Keep or Discard the Pruned Version?

Solution

🔎 Hint

Compare original vs pruned accuracy. Apply the reduced error pruning decision rule.

Version	V1	V2	V3	V4	Accuracy
Original tree	✓	✓	✓	✓	4/4
Pruned tree	✗	✓	✓	✓	3/4

Decision

Pruned accuracy (3/4) < original accuracy (4/4) → accuracy decreased.

→ By the reduced error pruning rule: keep the original tree.

The validation data confirms the km ≤ 30 split is genuinely useful — even though it was built on only 2 training examples. The CrossFit pattern holds in the validation set too.

Exercise 5 — Step 5: Pre-Pruning

Solution

✏️ Task

Would pre-pruning have prevented the km ≤ 30 node from being created? What minimum examples threshold would you set?

Solution

Which examples reach the km ≤ 30 node?
5 athletes (Suffering=Yes, Social=Yes). After the km split:
— km ≤ 30: 2 examples (CrossFit)
— km > 30: 3 examples (Hyrox)

With min_samples = 3: the km ≤ 30 split would be blocked — the CrossFit subset has only 2 examples, below the threshold. The node would become a leaf: Hyrox (majority of 5).

Would that have been correct?
No — post-pruning showed the split was valid. Pre-pruning with min_samples=3 would have removed a genuinely useful node.

Key Insight

Pre-pruning is faster but may stop too early — it cannot use validation evidence. Post-pruning waits for real data before deciding. With small datasets like this one, post-pruning is the safer choice.

Recursive Learning Process

In each recursive step of decision tree learning, there are four cases to consider:

Mixed examples (positive and negative): Choose the best attribute and split
Pure node (all examples have same class): Create a leaf node with that class
No examples: Create a leaf with the majority class from the parent node
No attributes left but mixed classes: Create a leaf with the majority class (handles noisy data)

Note

Cases 3 and 4 are important for handling edge cases with small or noisy datasets. The algorithm is robust because it always has a way to proceed regardless of the data configuration.

Decision Trees

Warm Up — Smart Glasses

General Introduction to Decision Trees

Decision Trees

How Does a Decision Tree Look Like?

Exercise 1 — Draw a Decision Tree

Solution: Exercise 1 — Draw a Decision Tree

1.3 — Types of Decision Trees

Learning Decision Trees

What is Learning Again?

Divide-and-Conquer Strategy

Entropy

Entropy: Bob's Skiing Example

Information Gain

Information Gain: Bob's Skiing Example

How to Build a Decision Tree

Exercise 2 — Which Race Should We Sign Up For?

Step 1 — Calculate H(D)

Step 2 — Calculate Information Gain

Step 2 — G(D, Enjoys Suffering)

Step 2 — G(D, Strength Level)

Step 2 — G(D, Social Butterfly)

Step 2 — Which Attribute is the Root?

Step 3 — Recurse: DSuffering=Yes

Solution — The Complete Tree

Gini Impurity

Exercise 3 — Decision Tree with Gini Impurity

Step 1 — Calculate Gini(D)

Step 2 — Weighted Gini for Enjoys Suffering

Step 2 — Weighted Gini for Social Butterfly

Step 2 — Weighted Gini for Strength Level

Step 3 — Which Attribute is the Root?

Numerical Attributes

Numerical Attributes: Snow Depth Example

Exercise 4 — Numeric Attributes: Which Race?

Exercise 4 — What We Already Know

Exercise 4 — Step 1: Sort Values and Identify Thresholds

Exercise 4 — Step 2: G(Dsoc=yes, km ≤ 30)

Exercise 4 — Step 3: G(Dsoc=yes, pace ≤ 5:57)

Exercise 4 — Step 4: Which Threshold to Choose?

Exercise 4 — The Complete Tree

3.2 — Pruning: Recap

Exercise 5 — Pruning

Exercise 5 — Step 1: Run Validation Through the Tree

Exercise 5 — Step 2: Apply Reduced Error Pruning

Exercise 5 — Step 3: Run Validation Through the Pruned Tree

Exercise 5 — Step 4: Keep or Discard the Pruned Version?

Exercise 5 — Step 5: Pre-Pruning

Recursive Learning Process

Questions?

Step 3 — Recurse: D_{Suffering=Yes}

Exercise 4 — Step 2: G(D_soc=yes, km ≤ 30)

Exercise 4 — Step 3: G(D_soc=yes, pace ≤ 5:57)