V 1.0

Decision Trees

Introduction to AI (I2AI)

Deinera Jechle    Neu-Ulm University of Applied Sciences
21. May 2026

Warm Up — Smart Glasses

Discussion
📰 In the news — 2026
Google I/O, May 19: Project Aura XR smart glasses confirmed for global launch — 90g, 70° display, Gemini AI, three cameras. It "sees and hears what you do" all day.

February 2026: Swedish investigation reveals Meta Ray-Ban glasses sent intimate footage — sex, nudity, bathroom use, bank card numbers — to data annotators in Kenya. Workers described seeing it daily. Their job: label the footage so Meta's AI could learn. Meta sold 7 million pairs in 2025. The LED indicator on the frame told bystanders they were being filmed. It did not tell them a human in Nairobi might be watching. Meta terminated the contract. 1,108 workers lost their jobs.
💬 Discuss with a partner
  1. Would you wear smart glasses? What would it take to trust them — or would you never?
  2. The data pipeline: AI models need human-labeled training data. Footage from your glasses feeds that pipeline. Is that an acceptable trade-off for a better product?
  3. Who is responsible? Meta? The annotators? The users who didn't read the privacy policy? The regulators who didn't act sooner?
Meta Ray-Ban AI smart glasses
The Next Web (2026). Meta ends Sama contract after Kenyan workers report seeing intimate footage from Ray-Ban smart glasses users. May 5. Svenska Dagbladet & Göteborgs-Posten (2026). Joint investigation on Meta Ray-Ban data annotation. Feb 27. Wong, R. (2026). Project Aura XR Smart Glasses Are Legit. Gizmodo, May 19.

General Introduction to Decision Trees

Definition, structure, advantages — and why decision trees remain a critical part of the AI toolkit despite decades of newer approaches.

01

Decision Trees

Discussion
💬 Discussion

What is a decision tree?

Answer

Decision trees are fundamental concepts in knowledge-based AI agents. They offer predictive power and interpretability

  • represent a function that maps attribute values to a decision
  • use search to find a decision through a sequence of tests
  • create a model that is inherintly explainable

How Does a Decision Tree Look Like?

A decision tree is a representation of a function that maps a vector of attribute values to a single output value (a "decision"):

  • An internal node represents a test of a property
  • Branches are labeled with possible values of the test
  • Each leaf node specifies the value to be returned if that leaf is reached

In Boolean decision trees, the input is a vector of input attributes X and a single Boolean output value y. Decision trees can handle both classification (discrete output) and regression (continuous output) tasks.

Example: Restaurant Decision (Russell & Norvig, 2022)

Should I wait for a table?
Input: 10 attributes (Alt, Bar, Fri, Hungry, Patrons, Price, WaitEstimate, Reservation, Raining, Type).
Output: WillWait (true / false)

Example decision tree

Figure 1: Decision tree restaurant example based on Russell & Norvig (2022, p. 674)

Exercise 1 — Draw a Decision Tree

Exercise
✏️ Exercise

A bank wants to decide whether to approve a loan application. The following attributes are available:

AttributeValues
Credit ScoreGood / Bad
IncomeHigh / Low
EmploymentEmployed / Unemployed
Existing DebtYes / No

Output: Approve (Yes / No)

  1. Draw a decision tree that classifies loan applications based on these attributes. Use your own intuition — there is no training set yet.
  2. How many leaf nodes does your tree have? How many internal nodes?
  3. Can every possible combination of attribute values be correctly classified by your tree?

Solution: Exercise 1 — Draw a Decision Tree

One intuitive decision tree based on domain knowledge, not training data. Credic Score is the dominant factor in lending, so it makes sense as the root:

Example decision tree

Figure 2: Decision tree loan example.

Note that this is just one possible solution, your tree might look very different. A real trained tree might look quite different — it would calculate information gain and discover which attribute actually separates approvals from denials most cleanly in the training data. The structure might surprise you.

Node count: 11

  • Root node: 1 - Credit score
  • Internal nodes (decision nodes): 4 - Credit Score, Income (left branch), Employment (right branch), and two Existing debt nodes
  • Leaf nodes: 6 - four on the bottom row (left branch) + Approve and Deny directly under Employment
  • Can every possible combination of attribute values be correctly classified by the tree?

    Yes — can you shorten this a bit? every possible combination of the 4 binary/categorical attributes leads to exactly one leaf. With 4 attributes that have 2 values each, there are 2⁴ = 16 possible input combinations (Credit Score has 2 values, Income has 2, Employment has 2, Existing Debt has 2). The tree as drawn doesn't test all four attributes on every path — for example, a Bad credit score + Unemployed path goes directly to Deny without checking Income or Existing Debt. This is intentional: once we're certain enough, we don't need more tests. This is exactly the principle that makes decision trees efficient, not every path needs to reach maximum depth.

    1.3 — Types of Decision Trees

    Binary Decision Trees

    Each node has exactly two branches (True / False)

    • For numeric features: "Is X ≤ threshold?" (e.g., "Pages Viewed ≤ 20?")
    • For categorical features: "Is X = value?" (e.g., "Referrer = Slashdot?")

    Maps directly to if-then-else statements in code — computationally efficient

    Non-Binary Decision Trees

    Categorical nodes can have multiple branches

    • Each branch corresponds to one possible value
    • Often visually simpler for features with many values
    • Requires switch-case logic — less efficient

    More intuitive when a feature has many categories (e.g., Country, Type)

    💬 Reflect

    What did we draw in Exercise 1? Binary or non-binary — and why? Where might we need a non-binary tree instead of a binary one?

    Learning Decision Trees

    From naive memorisation to generalisation — divide-and-conquer, entropy, information gain, Gini impurity, and numerical attributes.

    02

    What is Learning Again?

    Recap
    💬 Discussion

    The naive approach: construct a tree with one path to a leaf for each example — testing all attributes along the path and attaching the classification to the leaf. What is wrong with this?

    Answer
    • This correctly classifies all training examples — but doesn't generalise
    • It just memorises the observations
    • We need a tree that is: (1) consistent with the training set, and (2) as small as possible to promote generalisation

    Finding the smallest consistent tree is intractable — but decision tree learning algorithms use greedy heuristics to efficiently find a reasonably small tree.

    Divide-and-Conquer Strategy

    Decision tree learning adopts a greedy divide-and-conquer approach:

    1. Select the most important attribute to test at the root
    2. Divide the training set into subsets corresponding to each value of that attribute
    3. Recursively apply the same process to each subset
    Goal

    Reach the correct classification with a small number of tests, creating a tree that is both accurate and shallow.

    Key Question

    But how do we determine which attribute is most important? → This is where entropy and information gain come in.

    Entropy

    Entropy comes from information theory and measures the unpredictability of a random variable. When entropy is high, the classes are mixed. When entropy is zero, we have a pure node.

    Formal: for a random variable X with possible values V(X):

    H(X) = − Σ p(x) · log₂(p(x))
    • 0 ≤ H(X) ≤ log₂(|V(X)|)
    • H(X) = 0 means complete certainty (pure node)
    • Maximum value occurs when all outcomes are equally likely
    Maximum Entropy — Examples
    • Binary variable (yes/no): max H = log₂(2) = 1 bit — when P(yes) = P(no) = 0.5
    • 4 possible values (card suit): max H = log₂(4) = 2 bits — each at probability 0.25
    • 8 possible values: max H = log₂(8) = 3 bits

    Entropy: Bob's Skiing Example

    Example

    Bob decides whether to go skiing based on: Snow nearby, Weekend, Sunny.

    SnowWeekendSunGo Skiing
    yesyesyesyes
    yesyesyesyes
    yesyesnoyes
    yesnoyesyes
    noyesyesyes
    noyesyesyes
    noyesyesno
    noyesnono
    nonoyesno
    nonoyesno
    nononono
    Entropy of "Go Skiing"

    6 yes, 5 no out of 11 examples:

    H(D) = −(6/11 · log₂(6/11) + 5/11 · log₂(5/11)) = 0.994

    Close to 1 → high uncertainty. The maximum entropy for a binary variable is 1, indicating we're almost at maximum unpredictability.

    Information Gain

    Information gain measures how much the entropy decreases when we split the data based on a particular attribute.

    Formal: for a dataset D and attribute X:

    G(D, X) = H(D) − Σ (|D_x| / |D|) · H(D_x)
    • H(D) — entropy of the original dataset
    • D_x — subset of D where attribute X has value x
    • |D_x| — size of subset D_x
    • H(D_x) — entropy of subset D_x
    Decision Rule

    The attribute with the highest information gain is the most important attribute — select it as the split node.

    Information Gain: Bob's Skiing Example

    Example

    Bob decides whether to go skiing based on: Snow nearby, Weekend, Sunny.

    SnowWeekendSunGo Skiing
    yesyesyesyes
    yesyesyesyes
    yesyesnoyes
    yesnoyesyes
    noyesyesyes
    noyesyesyes
    noyesyesno
    noyesnono
    nonoyesno
    nonoyesno
    nononono
    Information Gain of "Go Skiing"

    Calculating information gain for the Snow nearby attribute:

    • Dyes: 4 examples, all Go Skiing = yes → entropy = 0 (pure node)
    • Dno: 7 examples, 2 yes and 5 no → entropy = 0.863
    G(D, Snow) = 0.994 − (4/11 · 0) − (7/11 · 0.863) = 0.445

    Similarly:

    • G(D, Weekend) = 0.150
    • G(D, Sun) = 0.049
    Result

    Since Snow nearby has the highest information gain (0.445), we select it as the root node.

    How to Build a Decision Tree

    After selecting Snow nearby as root, continue the process recursively for each subset:

    • Dyes — all examples have Go Skiing = yes → pure node, we're done
    • Dno — mixed yes and no → calculate information gain for remaining attributes on this subset:
      • Weekend has higher gain (0.292) than Sun (0.169) → split on Weekend
      • Continue recursively…
    Decision tree for Go Skiing? — Snow nearby as root, branching through Weekend and Sun

    Figure 3: Decision tree Example.

    Exercise 2 — Which Race Should We Sign Up For?

    Dataset
    Exercise
    You are training hard in your fitness class each week. A member of your team proposes to attend a race. The group can't agree. So you do what any data-driven athlete would do: collect data from 10 athletes who already chose — and build a decision tree to let the model decide. Calculate (1) entropy of the dataset, (2) information gain of each attribute, (3) identify the root node.
    #Strength LevelEjoys SufferingSocial ButterflyEvent
    1LowYesNoMarathon
    2MediumYesYesHyrox
    3LowNoNoRest
    4HighYesNoMarathon
    5MediumNoYesRest
    6LowNoNoRest
    7LowNoNoRest
    8MediumYesYesHyrox
    9HighYesYesHyrox
    10LowYesNoMarathon

    Step 1 — Calculate H(D)

    Exercise
    Recall
    H(D) = − Σ p(x) · log₂(p(x))
    ✏️ Task
    1. Count the examples per class. What is the probability of each?
    2. Calculate H(D).
    3. Is the entropy high or low relative to the maximum? What does it tell us?
    Solution

    Marathon: 3/10 · Hyrox: 3/10 · Rest: 4/10

    H(D) = −(3/10·log₂(3/10) + 3/10·log₂(3/10) + 4/10·log₂(4/10))
    = −(3/10·(−1.737) + 3/10·(−1.737) + 4/10·(−1.322))
    = −(−0.521 − 0.521 − 0.529) = 1.571

    Maximum for 3 classes = log₂(3) = 1.585 — we are very close to maximum uncertainty. The dataset is highly mixed.

    Step 2 — Calculate Information Gain

    Exercise
    Recall
    H(D) = − Σ p(x) · log₂(p(x))

    G(D, X) = H(D) − Σ (|D_x| / |D|) · H(D_x)
    ✏️ Task — repeat for each of the 3 attributes
    1. Split D by the attribute — how many Ma / Hy / Re fall into each subset?
    2. For each subset
      2.1 calculate entropy
      2.2 calculate information gain G(D,X)
    3. Note the result — compare all three G values at the end to identify the root node

    Step 2 — G(D, Enjoys Suffering)

    Exercise

    Split D by Enjoys Suffering (Yes / No):

    Subset#MaHyRe
    Yes6330
    No4004
    Solution

    H(DYes) = −(3/6·log₂(3/6) + 3/6·log₂(3/6))
    = −(3/6·(−1.000) + 3/6·(−1.000))
    = −(−0.500 − 0.500) = 1.000

    H(DNo) = −(4/4·log₂(4/4)) = −(1·0) = 0.000 (pure — all Rest)

    G(D, Suffering) = 1.571 − (6/10)·1.000 − (4/10)·0.000
    = 1.571 − 0.600 − 0.000 = 0.971

    Step 2 — G(D, Strength Level)

    Exercise

    Split D by Strength Level (High / Medium / Low):

    Subset#MaHyRe
    High2110
    Medium3021
    Low5203
    Solution

    H(DHigh) = −(1/2·log₂(1/2) + 1/2·log₂(1/2))
    = −(1/2·(−1.000) + 1/2·(−1.000))
    = −(−0.500 − 0.500) = 1.000

    H(DMedium) = −(2/3·log₂(2/3) + 1/3·log₂(1/3))
    = −(2/3·(−0.585) + 1/3·(−1.585))
    = −(−0.390 − 0.528) = 0.918

    H(DLow) = −(2/5·log₂(2/5) + 3/5·log₂(3/5))
    = −(2/5·(−1.322) + 3/5·(−0.737))
    = −(−0.529 − 0.442) = 0.971

    G(D, Strength) = 1.571 − (2/10)·1.000 − (3/10)·0.918 − (5/10)·0.971
    = 1.571 − 0.200 − 0.275 − 0.486 = 0.610

    Step 2 — G(D, Social Butterfly)

    Exercise

    Split D by Social Butterfly (Yes / No):

    Subset#MaHyRe
    Yes4031
    No6303
    Solution

    H(DYes) = −(3/4·log₂(3/4) + 1/4·log₂(1/4))
    = −(3/4·(−0.415) + 1/4·(−2.000))
    = −(−0.311 − 0.500) = 0.811

    H(DNo) = −(3/6·log₂(3/6) + 3/6·log₂(3/6))
    = −(3/6·(−1.000) + 3/6·(−1.000))
    = −(−0.500 − 0.500) = 1.000

    G(D, Social) = 1.571 − (4/10)·0.811 − (6/10)·1.000
    = 1.571 − 0.324 − 0.600 = 0.647

    Step 2 — Which Attribute is the Root?

    Solution
    AttributeG(D, X)
    Enjoys Suffering 0.971 → highest → root
    Social Butterfly0.647
    Strength Level0.610
    Result

    Enjoys Suffering has the highest information gain (0.971) → it is the root node.
    Splitting immediately produces one pure leaf: Suffering = No → all 4 athletes chose Rest.

    Step 3 — Recurse: DSuffering=Yes

    Additional Exercise

    DYes has 6 examples: Marathon×3, Hyrox×3, Rest×0. H(DYes) = 1.000. Calculate G for the remaining 2 attributes.

    ✏️ Task
    1. Split DYes by Social Butterfly — count Ma/Hy per subset, calculate H, calculate G
    2. Split DYes by Strength Level — same process
    3. Which attribute should split DYes?
    Solution

    G(DYes, Social):

    Social=Yes (3): Ma=0, Hy=3 → H = 0.000 (pure)
    Social=No (3): Ma=3, Hy=0 → H = 0.000 (pure)

    = 1.000 − (3/6)·0.000 − (3/6)·0.000
    = 1.000 − 0 − 0 = 1.000

    G(DYes, Strength):

    High (2): Ma=1, Hy=1 → H = 1.000
    Medium (2): Ma=0, Hy=2 → H = 0.000 (pure)
    Low (2): Ma=2, Hy=0 → H = 0.000 (pure)

    = 1.000 − (2/6)·1.000 − 0 − 0
    = 1.000 − 0.333 = 0.667

    Result

    Social Butterfly has the highest G (1.000) → perfect split.
    Social=Yes → pure: Hyrox (3/3) · Social=No → pure: Marathon (3/3)

    Solution — The Complete Tree

    Solution
    Decision tree: Enjoys Suffering root, No → Rest, Yes → Social Butterfly, No → Marathon, Yes → Hyrox
    Key takeaways
    • Suffering is root — highest G (0.971), immediately isolates all Rest cases
    • Social perfectly separates Marathon (No) from Hyrox (Yes) — G = 1.000, maximum possible
    • Strength was never needed — the tree is complete without it
    • All 3 leaf nodes are pure · 2 internal nodes · depth 2

    Gini Impurity

    Instead of information gain, Gini impurity is another common splitting criterion — it measures the probability of incorrectly classifying a randomly chosen element if it were randomly labeled according to the class distribution:

    Gini(X) = Σ p(x) · (1 − p(x))
    Properties
    • 0 = all elements belong to the same class (pure)
    • Maximum when classes are equally likely
    • Lower value = "purer" node
    • Often used in CART (Classification and Regression Trees) algorithm
    Entropy vs. Gini
    • Entropy measures uncertainty
    • Gini measures expected error rate
    • In practice both often yield similar trees
    • Gini is computationally more efficient — no logarithms

    Exercise 3 — Decision Tree with Gini Impurity

    Exercise
    Recall — Gini Impurity
    Gini(D) = 1 − Σ p(x)²

    Measures the probability of incorrectly classifying a randomly chosen element. A pure node has Gini = 0. Instead of information gain, we now look for the attribute with the lowest weighted Gini impurity after splitting.

    ✏️ Task — repeat for each of the 3 attributes
    1. Split D by the attribute — how many Ma / Hy / Re fall into each subset?
    2. For each subset, calculate Gini(Dx) = 1 − Σ p(x)²
    3. Calculate the weighted Gini impurity: Σ (|Dx| / |D|) · Gini(Dx)
    4. Note the result — compare all three values at the end to identify the root node

    Step 1 — Calculate Gini(D)

    Exercise
    Recall
    Gini(D) = 1 − Σ p(x)²
    ✏️ Task
    1. Count the examples per class. What is the probability of each?
    2. Calculate Gini(D).
    3. Is the Gini high or low relative to the maximum? What does it tell us?
    Solution

    Marathon: 3/10 · Hyrox: 3/10 · Rest: 4/10

    Gini(D) = 1 − (3/10)² − (3/10)² − (4/10)²
    = 1 − 0.090 − 0.090 − 0.160 = 0.660

    Maximum for 3 equal classes = 1 − 3·(1/3)² = 0.667 — we are very close to maximum impurity. The dataset is highly mixed.

    Step 2 — Weighted Gini for Enjoys Suffering

    Exercise

    Split D by Enjoys Suffering (Yes / No):

    SubsetnMaHyRe
    Yes6330
    No4004
    ✏️ Task
    1. Calculate Gini(DYes) and Gini(DNo)
    2. Calculate the weighted Gini = (6/10)·Gini(DYes) + (4/10)·Gini(DNo)
    Solution

    Gini(DYes) = 1 − (3/6)² − (3/6)²
    = 1 − 0.250 − 0.250 = 0.500

    Gini(DNo) = 1 − (4/4)²
    = 1 − 1.000 = 0.000 (pure — all Rest)

    Weighted Gini(Suffering) = (6/10)·0.500 + (4/10)·0.000
    = 0.300 + 0.000 = 0.300

    Step 2 — Weighted Gini for Social Butterfly

    Exercise

    Split D by Social Butterfly (Yes / No):

    SubsetnMaHyRe
    Yes4031
    No6303
    Solution

    Gini(DYes) = 1 − (3/4)² − (1/4)²
    = 1 − 0.5625 − 0.0625 = 0.375

    Gini(DNo) = 1 − (3/6)² − (3/6)²
    = 1 − 0.250 − 0.250 = 0.500

    Weighted Gini(Social) = (4/10)·0.375 + (6/10)·0.500
    = 0.150 + 0.300 = 0.450

    Step 2 — Weighted Gini for Strength Level

    Exercise

    Split D by Strength Level (High / Medium / Low):

    SubsetnMaHyRe
    High2110
    Medium3021
    Low5203
    Solution

    Gini(DHigh) = 1 − (1/2)² − (1/2)²
    = 1 − 0.250 − 0.250 = 0.500

    Gini(DMedium) = 1 − (2/3)² − (1/3)²
    = 1 − 0.444 − 0.111 = 0.444

    Gini(DLow) = 1 − (2/5)² − (3/5)²
    = 1 − 0.160 − 0.360 = 0.480

    Weighted Gini(Strength) = (2/10)·0.500 + (3/10)·0.444 + (5/10)·0.480
    = 0.100 + 0.133 + 0.240 = 0.473

    Step 3 — Which Attribute is the Root?

    Solution
    AttributeWeighted Gini impurityG (Info Gain)
    Enjoys Suffering 0.300 → lowest → root 0.971 → highest
    Social Butterfly0.4500.647
    Strength Level0.4730.610
    Result

    Enjoys Suffering has the lowest weighted Gini impurity (0.300) → it is the root node. This matches the Information Gain result from Exercise 2.

    Reflect

    Gini and Information Gain agree here — both rank the attributes in the same order. In practice they usually agree, but can differ when one attribute has many values or when class distributions are skewed. Gini is computationally cheaper — no logarithms required.

    Numerical Attributes

    For numerical attributes, rather than treating each value as a separate category, we find the optimal binary split point:

    1. Sort the values of the attribute in the training set
    2. Consider all possible thresholds between adjacent values
    3. Calculate information gain (or Gini) for each threshold
    4. Select the threshold with the highest information gain
    Key Insight

    The same numerical attribute can be used multiple times in a decision tree with different thresholds.

    Numerical Attributes: Snow Depth Example

    Example

    Replacing Snow nearby with Snow Depth (cm) as a numerical attribute:

    Sorted values: 0, 0, 1, 2, 3, 5, 8, 15, 25, 30, 35

    Testing threshold: Snow Depth ≤ 10 cm

    • ≤ 10 cm: 7 examples (2 yes, 5 no)
    • > 10 cm: 4 examples (4 yes, 0 no)
    H(D≤10) = 0.863
    H(D>10) = 0 (pure)
    G(D, Depth≤10) = 0.994 − 7/11·0.863 − 4/11·0 = 0.445
    Decision tree: Snow Depth ≤ 10cm root, No → Go Skiing Yes, Yes → Weekend → Sun
    Result

    Same information gain (0.445) as Snow nearby — equally good root node choice.

    Exercise 4 — Numeric Attributes: Which Race?

    Dataset
    Extension

    We add another competition "CrossFit" which results in mixed leaf nodes for Enjoys Suffering = Yes and Social Butterfly = Yes.
    Two numerical attributes are now added: weekly training volume (km) and running pace (min/km). Your task: find the best threshold for each and build the full tree.

    #SufferingSocialWeekly kmPace (min/km)Event
    1YesNo654:30Marathon
    2YesYes425:30Hyrox
    3NoNo07:00Rest
    4YesNo704:18Marathon
    5NoYes206:45Rest
    6NoNo107:15Rest
    7NoNo157:30Rest
    8YesYes385:45Hyrox
    9YesYes56:12CrossFit
    10YesNo604:45Marathon
    11YesYes226:30CrossFit
    12YesYes455:18Hyrox

    Exercise 4 — What We Already Know

    Recap

    The categorical splits are already solved. All relevant values are provided — your task is only to find the best numeric threshold.

    Partial decision tree: Suffering root, No → Rest, Yes → Social, No → Marathon, Yes → numeric split to be found
    ✏️ Task
    1. Sort the Weekly km values. What thresholds would you test?
    2. Test km ≤ threshold — calculate H for each subset, then G
    3. Test pace ≤ 5:57 — same process
    4. Which threshold gives the highest G? Are both leaves pure?

    Exercise 4 — Step 1: Sort Values and Identify Thresholds

    Solution

    The Social=Yes branch has 5 athletes. Sort both numerical attributes:

    Weekly km — sorted

    5 · 22 · 38 · 42 · 45

    Thresholds to test (midpoints between adjacent values):

    14 · 30 · 40 · 44

    Recall

    For numerical attributes we test all midpoints between adjacent sorted values — the threshold with the highest information gain is selected.

    Because 30 sits exactly between the CrossFit athletes (km = 5 and 22) and the Hyrox athletes (km = 38, 42, 45). Everything ≤ 30 is CrossFit, everything > 30 is Hyrox — a perfect natural gap in the data. You wouldn't know this upfront. You'd test all four thresholds (14, 30, 40, 44), calculate G for each, and 30 would come out highest — which is what Step 2 shows.

    Exercise 4 — Step 2: G(Dsoc=yes, km ≤ 30)

    Solution

    Split the Social=Yes branch by Weekly km ≤ 30:

    SubsetAthletesnHyCF
    km ≤ 30#9 (5 km), #11 (22 km)202
    km > 30#8 (38 km), #2 (42 km), #12 (45 km)330
    Calculation

    H(Dkm≤30) = −(2/2·log₂(2/2)) = −(1·0) = 0.000 (pure — all CrossFit)

    H(Dkm>30) = −(3/3·log₂(3/3)) = −(1·0) = 0.000 (pure — all Hyrox)

    G(Dsoc=yes, km≤30) = 0.971 − (2/5)·0.000 − (3/5)·0.000
    = 0.971 − 0 − 0 = 0.971 — perfect split

    Exercise 4 — Step 3: G(Dsoc=yes, pace ≤ 5:57)

    Solution

    Split the Social=Yes branch by Pace ≤ 5:57:

    SubsetAthletesnHyCF
    pace ≤ 5:57#12 (5:18), #2 (5:30), #8 (5:45)330
    pace > 5:57#9 (6:12), #11 (6:30)202
    Calculation

    H(Dpace≤5:57) = −(3/3·log₂(3/3)) = 0.000 (pure — all Hyrox)

    H(Dpace>5:57) = −(2/2·log₂(2/2)) = 0.000 (pure — all CrossFit)

    G(Dsoc=yes, pace≤5:57) = 0.971 − (3/5)·0.000 − (2/5)·0.000
    = 0.971 − 0 − 0 = 0.971 — equally perfect split

    Exercise 4 — Step 4: Which Threshold to Choose?

    Solution
    ThresholdGBoth leaves pure?
    Weekly km ≤ 300.971Yes — CrossFit (2/2) · Hyrox (3/3)
    Pace ≤ 5:570.971Yes — Hyrox (3/3) · CrossFit (2/2)
    Result

    Both thresholds give G = 0.971 — a perfect split. This is not a coincidence: CrossFit athletes in this dataset consistently train fewer km at slower pace, while Hyrox athletes train more km at faster pace. The two attributes are perfectly correlated in this subset. Either split is valid — use Weekly km ≤ 30 for the tree.

    Pruning note

    The CrossFit leaf is based on only 2 examples. This makes it a prime candidate for pruning — we will revisit this in Exercise 5.

    Exercise 4 — The Complete Tree

    Solution
    Complete decision tree: Suffering → Social → Weekly km ≤ 30 → CrossFit / Hyrox
    Key takeaways
    • Suffering — root, isolates Rest immediately
    • Social=No → Marathon pure — marathoners train alone
    • Weekly km ≤ 30 separates CrossFit (low volume) from Hyrox (higher volume)
    • Pace ≤ 5:57 is an equally valid alternative split — same result, different attribute

    3.2 — Pruning: Recap

    Our tree perfectly classifies all 12 training examples — but is it too specific?

    The Problem: Overfitting
    • A deeper tree fits training data more perfectly
    • But it may learn noise, not patterns
    • Result: poor generalisation to new athletes
    • Our CrossFit leaf is based on only 2 examples — how confident are we?
    The Solution: Pruning
    • Pre-pruning: stop growing early — set minimum examples per node, maximum depth, or minimum IG threshold
    • Post-pruning: build the full tree, then remove nodes that do not improve performance on a validation set

    Post-pruning is more robust — it uses real evidence to decide what to remove.

    Reduced Error Pruning

    Replace an internal node with its most frequent class label. If accuracy on the validation set does not decrease — keep the pruned version. Simpler tree, same or better generalisation.

    Exercise 5 — Pruning

    Exercise

    Consider the complete tree from Exercise 4. A validation set of 4 new athletes has arrived:

    #SufferingSocialWeekly kmPaceEvent (actual)
    V1YesYes286:05CrossFit
    V2YesYes405:40Hyrox
    V3YesNo554:55Marathon
    V4NoNo87:10Rest
    ✏️ Task
    1. Run all 4 validation athletes through the tree. What does the tree predict for each? How many are correctly classified?
    2. Apply reduced error pruning to the Weekly km ≤ 30 node — replace it with the majority class of its training examples (Hyrox×3, CrossFit×2). What is the pruned prediction for V1 and V2?
    3. Does pruning increase or decrease accuracy on the validation set? Should we keep the pruned version?
    4. Would pre-pruning have prevented this node from being created? What minimum examples threshold would you set?

    Exercise 5 — Step 1: Run Validation Through the Tree

    Solution
    #Path through treePredictedActualCorrect?
    V1 Suffering=Yes → Social=Yes → km=28 ≤ 30 CrossFit CrossFit
    V2 Suffering=Yes → Social=Yes → km=40 > 30 Hyrox Hyrox
    V3 Suffering=Yes → Social=No Marathon Marathon
    V4 Suffering=No Rest Rest
    Result

    The original tree correctly classifies 4/4 validation examples — perfect accuracy on the validation set.

    Exercise 5 — Step 2: Apply Reduced Error Pruning

    Solution

    We consider pruning the Weekly km ≤ 30 node. The rule for reduced error pruning:

    Recall — Reduced Error Pruning

    Replace the node with its majority class from the training examples that reach it. Then check: does accuracy on the validation set decrease? If not — keep the pruned version.

    🔎 Hint
    1. Which training examples reach the km ≤ 30 node? List them.
    2. Count the classes — what is the majority class?
    3. Replace the node with that class label — draw the pruned tree.
    Solution

    Training examples reaching km ≤ 30: athletes who have Suffering=Yes AND Social=Yes.
    That is athletes #2, #8, #9, #11, #12 → Hyrox×3 (km: 38, 42, 45) · CrossFit×2 (km: 5, 22)

    Majority class: Hyrox (3 out of 5)

    Pruned tree: replace km ≤ 30 node with leaf Hyrox

    Exercise 5 — Step 3: Run Validation Through the Pruned Tree

    Solution

    Pruned tree: Suffering → Social → Hyrox (leaf, no more km split).

    🔎 Hint

    For each validation athlete, follow the pruned tree and write down the prediction. Compare to actual.

    #SufferingSocialkmPruned predictionActualCorrect?
    V1YesYes28 Hyrox (leaf — km ignored) CrossFit
    V2YesYes40 Hyrox Hyrox
    V3YesNo55 Marathon Marathon
    V4NoNo8 Rest Rest
    Result

    Pruned accuracy: 3/4 (75%) — V1 is now misclassified.

    Exercise 5 — Step 4: Keep or Discard the Pruned Version?

    Solution
    🔎 Hint

    Compare original vs pruned accuracy. Apply the reduced error pruning decision rule.

    VersionV1V2V3V4Accuracy
    Original tree 4/4
    Pruned tree 3/4
    Decision

    Pruned accuracy (3/4) < original accuracy (4/4) → accuracy decreased.

    → By the reduced error pruning rule: keep the original tree.

    The validation data confirms the km ≤ 30 split is genuinely useful — even though it was built on only 2 training examples. The CrossFit pattern holds in the validation set too.

    Exercise 5 — Step 5: Pre-Pruning

    Solution
    ✏️ Task

    Would pre-pruning have prevented the km ≤ 30 node from being created? What minimum examples threshold would you set?

    Solution

    Which examples reach the km ≤ 30 node?
    5 athletes (Suffering=Yes, Social=Yes). After the km split:
    — km ≤ 30: 2 examples (CrossFit)
    — km > 30: 3 examples (Hyrox)

    With min_samples = 3: the km ≤ 30 split would be blocked — the CrossFit subset has only 2 examples, below the threshold. The node would become a leaf: Hyrox (majority of 5).

    Would that have been correct?
    No — post-pruning showed the split was valid. Pre-pruning with min_samples=3 would have removed a genuinely useful node.

    Key Insight

    Pre-pruning is faster but may stop too early — it cannot use validation evidence. Post-pruning waits for real data before deciding. With small datasets like this one, post-pruning is the safer choice.

    Recursive Learning Process

    In each recursive step of decision tree learning, there are four cases to consider:

    1. Mixed examples (positive and negative): Choose the best attribute and split
    2. Pure node (all examples have same class): Create a leaf node with that class
    3. No examples: Create a leaf with the majority class from the parent node
    4. No attributes left but mixed classes: Create a leaf with the majority class (handles noisy data)
    Note

    Cases 3 and 4 are important for handling edge cases with small or noisy datasets. The algorithm is robust because it always has a way to proceed regardless of the data configuration.

    Questions?

    What remains unclear?

    ?